Creating Corpus for Georgian Language Modelling
Georgian is not just low-resource — it is remarkably under-researched for a living language spoken by millions. Existing large language models handle it poorly: GPT-3’s tokenizer uses 2–3 BPE tokens per single Georgian character, compared to one token per English word.
To address the absence of well-organized training data, we built a complete data collection and processing pipeline and published an initial 37GB clean corpus on HuggingFace.
Pipeline
Data flows through four stages:
- Collection — site-specific web crawlers and PDF extraction (via PyMuPDF with custom Georgian encoding normalization) across 25 Georgian domains, plus CulturaX’s Georgian subset.
- Metric-based filtering — Georgian character ratio, repetition/stopword/flagged-word ratios, language detection via FastText.
- Cleaning — PII anonymization and Mtavruli→Mkhedruli character normalization.
- Deduplication & splitting — URL deduplication + MinHashLSH, then 90/5/5 train/val/test split in JSONL format.
90% of the collected data is unique content not present in CulturaX.
Resources
- Paper PDF
- Submitted to ACL ARR February 2024
Leave a comment