Open Source · AI & NLP
Kurdish AI Corpus
Large-scale multi-source Kurdish dataset built for training AI language models. 1.8M documents, 625M tokens, four dialects.
PythonHuggingFaceNLP
Features
625M Tokens · 1.8M Documents
Sourced from Wikipedia, web and news sites.
4 Dialects
Covers Sorani, Kurmanji, Zazaki, and Hawrami.
Quality Scoring
Every document scored 0–100 to enable clean, filtered model training.
Open License
Published under CC-BY 4.0.
Challenge & Solution
The Challenge
Kurdish has multiple dialects with very limited and fragmented public text data. Existing datasets are too small or cover only one dialect.
The Solution
Built a collection pipeline gathering documents from news, Wikipedia, and the web - with automated quality scoring.
Have an idea. Let's build it.
Next Project