Creating Corpus for Georgian Language Modelling

less than 1 minute read

Georgian is not just low-resource — it is remarkably under-researched for a living language spoken by millions. Existing large language models handle it poorly: GPT-3’s tokenizer uses 2–3 BPE tokens per single Georgian character, compared to one token per English word.

To address the absence of well-organized training data, we built a complete data collection and processing pipeline and published an initial 37GB clean corpus on HuggingFace.

Pipeline

Data flows through four stages:

Collection — site-specific web crawlers and PDF extraction (via PyMuPDF with custom Georgian encoding normalization) across 25 Georgian domains, plus CulturaX’s Georgian subset.
Metric-based filtering — Georgian character ratio, repetition/stopword/flagged-word ratios, language detection via FastText.
Cleaning — PII anonymization and Mtavruli→Mkhedruli character normalization.
Deduplication & splitting — URL deduplication + MinHashLSH, then 90/5/5 train/val/test split in JSONL format.

90% of the collected data is unique content not present in CulturaX.

Resources

Paper PDF
Submitted to ACL ARR February 2024

Share on

Twitter Facebook LinkedIn

Anzor Gozalishvili

Creating Corpus for Georgian Language Modelling

Pipeline

Resources

Share on

Leave a comment