Skip to main content
Open Source · AI & NLP

Kurdish AI Corpus

Large-scale multi-source Kurdish dataset built for training AI language models. 1.8M documents, 625M tokens, four dialects.

PythonHuggingFaceNLP
Kurdish AI Corpus

Features

625M Tokens · 1.8M Documents

Sourced from Wikipedia, web and news sites.

4 Dialects

Covers Sorani, Kurmanji, Zazaki, and Hawrami.

Quality Scoring

Every document scored 0–100 to enable clean, filtered model training.

Open License

Published under CC-BY 4.0.

Challenge & Solution

The Challenge

Kurdish has multiple dialects with very limited and fragmented public text data. Existing datasets are too small or cover only one dialect.

The Solution

Built a collection pipeline gathering documents from news, Wikipedia, and the web - with automated quality scoring.

Have an idea. Let's build it.

Next Project