The Semantic Web Lab (SWlab) at the University of Zakho has made a significant contribution to the field of Kurdish Natural Language Processing (KNLP) with the publication of a groundbreaking research paper in the prestigious Digital Scholarship in the Humanities journal, published by Oxford University Press.
Authored by Dastan Maulud, Karwan Jacksi, and Ismail Ali, the paper, titled “A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging,” presents a major advancement in Kurdish NLP by introducing the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect.
POS tagging, a fundamental task in NLP, involves assigning grammatical categories (e.g., noun, verb, adjective) to words in a sentence. This crucial step is essential for a wide range of NLP applications, including speech recognition, natural language parsing, information retrieval, and multiword term extraction.
The research addresses the limitations of existing rule-based approaches to POS tagging by developing a hybrid approach that combines the strengths of a bigram hidden Markov model with a rule-based approach specifically designed for Kurdish. This hybrid approach effectively addresses two key challenges: misclassified words and ambiguity-related unanalyzed words.
The DASTAN corpus, containing 74,258 words and 38 tags, provides a valuable resource for the KNLP community and will significantly contribute to the development and improvement of various Kurdish NLP applications.
This publication in the Digital Scholarship in the Humanities journal is a testament to the high-quality research conducted by the SWlab at the University of Zakho and its significant contributions to the advancement of Kurdish language technologies.
For further details and access to the full paper, please refer to:
- Publication: Digital Scholarship in the Humanities
- Volume: 38
- Issue: 4
- Pages: 1604-1612
- Authors: Dastan Maulud, Karwan Jacksi, Ismail Ali
- Publisher: Oxford University Press