Open Source · AI & NLP

Kurdish AI Corpus

Large-scale multi-source Kurdish dataset built for training AI language models. 1.8M documents, 625M tokens, four dialects.

PythonHuggingFaceNLP

Features

Sourced from Wikipedia, web and news sites.

Covers Sorani, Kurmanji, Zazaki, and Hawrami.

Every document scored 0–100 to enable clean, filtered model training.

Published under CC-BY 4.0.

Kurdish has multiple dialects with very limited and fragmented public text data. Existing datasets are too small or cover only one dialect.

Built a collection pipeline gathering documents from news, Wikipedia, and the web - with automated quality scoring.

Next Project

Built for Home service businesses

Two-sided home services marketplace with real-time chat and online payment or cash. Built in 6 weeks.