A Human-in-the-Loop Morphological Framework for Sindhi Language (Research Paper)
A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation + rule-based morphology Key Resources dic.sindhila.edu.pksindhilanguage.org Abstract Sindhi,…
Key Insight
A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation…
A Human-in-the-Loop Morphological Framework for Sindhi
From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering
| Author | Amar Fayaz Buriro |
| Field | Computational Linguistics / Sindhi NLP |
| Language | Sindhi |
| Core Idea | Human annotation + rule-based morphology |
| Key Resources | dic.sindhila.edu.pk sindhilanguage.org |
Abstract
Sindhi, a morphologically rich Indo-Aryan language, remains significantly underrepresented in computational linguistics due to the scarcity of structured datasets and morphology-aware processing frameworks. This article presents a human-in-the-loop morphological annotation and rule extraction framework aimed at systematically modeling Sindhi verb morphology.
The approach is grounded in the relationship between imperative forms such as لک and infinitive forms such as لکڻ. These forms function as foundational anchors for morphological abstraction, lemmatization, and rule-based generation.
The framework supports collaborative annotation, conflict detection, pattern extraction, and future integration with Universal Dependencies. It establishes a pathway toward Sindhi lemmatizers, morphological analyzers, spell checkers, OCR post-correction systems, and AI-ready linguistic datasets.
1. Introduction
Natural Language Processing has advanced rapidly for high-resource languages, yet many linguistically rich languages remain computationally underrepresented. Sindhi is one such language. It possesses a long literary tradition, a complex grammatical system, and highly productive verbal morphology, but it still lacks large-scale annotated corpora and morphology-aware computational tools.
A single Sindhi verbal root may generate hundreds or even thousands of surface forms. For example, the imperative لک and infinitive لکڻ can generate forms such as لکندومان, لکنديمانس, and لکرائينديمانس. These forms encode tense, aspect, person, number, and object relations within compact orthographic units.
2. Linguistic Foundations of Sindhi Verb Morphology
Sindhi verb morphology is built around the relationship between the imperative and the infinitive. The imperative often represents the operational root, while the infinitive functions as the lemma or dictionary form.
- لک → لکڻ
- مار → مارڻ
The same morphological patterns can be applied across multiple roots. For example, patterns derived from لک can be applied to مار, producing comparable forms such as ماريندومان and مارينديمانس.
2.1 Suffix Chains
Sindhi verbal forms frequently contain suffix chains that encode grammatical information. For example:
Such structures demonstrate that Sindhi verbs are not isolated word forms but layered morphological constructions.
3. Methodology: Human-in-the-Loop Annotation
Because Sindhi lacks large pre-annotated corpora, the framework adopts a human-in-the-loop approach. Native speakers and linguistic contributors annotate words through a web-based system. Each word may be tagged for part-of-speech, gender, number, tense, case, and person.
The system supports multi-label tagging because some Sindhi words may simultaneously carry more than one grammatical role. For example, a form may function as both a verb and a pronominal expression.
3.1 Annotation Categories
- Part of Speech: verb, pronoun, noun, etc.
- Gender: masculine, feminine, neuter/not applicable
- Number: singular, plural, not applicable
- Tense: past, present, future, not applicable
- Case: nominative, accusative, genitive, vocative, etc.
- Person/Pronoun: first person, second person, third person, speaker/addressee/other
4. From Annotation to Pattern Extraction
The central transition in this project is from word-level annotation to pattern-level abstraction. Once enough users annotate the same lexical items, the system can identify recurring morphological structures and convert them into reusable rules.
For example, if a suffix pattern is observed in forms derived from لک, and the same pattern appears in forms derived from مار, the system can treat it as a reusable morphological template rather than an isolated form.
5. Scaling Through Combinatorial Morphology
The power of this framework lies in combinatorial expansion. If 3,000 imperative–infinitive pairs are collected and approximately 700–800 reliable patterns are identified, the system can generate millions of valid Sindhi verbal forms.
When person markers, object markers, causative forms, tense variations, and dialectal variants are added, this space may scale toward tens of millions of morphologically meaningful forms.
6. System Architecture
The framework is designed as a modular web-based architecture. It contains an annotation interface, user management system, storage layer, result engine, conflict detection mechanism, and future rule engine.
- Annotation Layer: users tag words through controlled inputs.
- Storage Layer: each user’s annotations are preserved separately.
- Result Engine: tagged words are merged and conflicts are highlighted.
- Rule Engine: future module for automatic suggestion and generation.
The design is non-destructive: raw annotations remain intact, while merged results and rule-based outputs are generated as additional layers.
7. Applications and Implications
The framework has multiple applications in Sindhi language technology. It can support lemmatization, morphological analysis, morphological generation, spell checking, grammar correction, OCR post-processing, and AI-based language modeling.
7.1 Lemmatization
Inflected forms can be mapped back to their lemmas:
- لکرائينديمانس → لکڻ
- ماريندومان → مارڻ
7.2 OCR Post-Correction
Sindhi OCR often produces merged, broken, or malformed word forms. A morphology-aware system can identify whether a word form is valid and suggest corrections based on known roots and patterns.
7.3 Universal Dependencies
The annotation categories can be aligned with Universal Dependencies features, enabling the future development of Sindhi treebanks and cross-linguistic NLP resources.
8. Discussion
The framework represents a hybrid model in which human linguistic knowledge and computational rule extraction work together. In low-resource languages, human knowledge is not a fallback; it is the foundational resource from which reliable computational systems can grow.
Rule-based morphology is especially valuable for Sindhi because the language exhibits high regularity in verbal patterns. At the same time, the system allows for ambiguity, variation, and disagreement by preserving multiple user annotations and highlighting conflicts.
The project therefore shifts the focus from isolated dataset creation to the construction of a broader linguistic infrastructure for Sindhi.
9. References
- Beesley, K. R., & Karttunen, L. (2003). Finite State Morphology. CSLI Publications.
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing.
- Nivre, J., et al. (2016). Universal Dependencies v1: A multilingual treebank collection.
- Cotterell, R., et al. (2017). SIGMORPHON Shared Task on Morphological Reinflection.
- Haspelmath, M. (2002). Understanding Morphology. Arnold.
- Aronoff, M., & Fudeman, K. (2011). What is Morphology? Wiley-Blackwell.
- Sindhi Language Authority. Comprehensive Sindhi Dictionary. dic.sindhila.edu.pk
- Buriro, A. F. Sindhi Morphological Annotation Dataset and Language Engineering Resources. sindhilanguage.org
FAQ
Questions this article answers
What is this article about?
A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation + rule-based morphology Key Resources dic.sindhila.edu.pksindhilanguage.org Abstract Sindhi, a morphologically rich…
Why is this topic important?
This article contributes to a broader discussion on language, society, culture, technology or artificial intelligence.
How does this relate to Amar Fayaz’s work?
It is part of Amar Fayaz’s wider intellectual work on language, literature, civilization and artificial intelligence.
What does “Abstract” explain?
This section expands one of the central ideas discussed in the article.
What does “1. Introduction” explain?
This section expands one of the central ideas discussed in the article.