AMAR FAYAZ Language • Civilization • Artificial Intelligence

May 20, 2026 · English · Research Paper

A Human-in-the-Loop Morphological Framework for Sindhi Language (Research Paper)

A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation + rule-based morphology Key Resources dic.sindhila.edu.pksindhilanguage.org Abstract Sindhi,…

Key Insight

A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation…

A Human-in-the-Loop Morphological Framework for Sindhi

From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering

Research Overview
AuthorAmar Fayaz Buriro
FieldComputational Linguistics / Sindhi NLP
LanguageSindhi
Core IdeaHuman annotation + rule-based morphology
Key Resourcesdic.sindhila.edu.pk
sindhilanguage.org

Abstract

Sindhi, a morphologically rich Indo-Aryan language, remains significantly underrepresented in computational linguistics due to the scarcity of structured datasets and morphology-aware processing frameworks. This article presents a human-in-the-loop morphological annotation and rule extraction framework aimed at systematically modeling Sindhi verb morphology.

The approach is grounded in the relationship between imperative forms such as لک and infinitive forms such as لکڻ. These forms function as foundational anchors for morphological abstraction, lemmatization, and rule-based generation.

The framework supports collaborative annotation, conflict detection, pattern extraction, and future integration with Universal Dependencies. It establishes a pathway toward Sindhi lemmatizers, morphological analyzers, spell checkers, OCR post-correction systems, and AI-ready linguistic datasets.

1. Introduction

Natural Language Processing has advanced rapidly for high-resource languages, yet many linguistically rich languages remain computationally underrepresented. Sindhi is one such language. It possesses a long literary tradition, a complex grammatical system, and highly productive verbal morphology, but it still lacks large-scale annotated corpora and morphology-aware computational tools.

A single Sindhi verbal root may generate hundreds or even thousands of surface forms. For example, the imperative لک and infinitive لکڻ can generate forms such as لکندومان, لکنديمانس, and لکرائينديمانس. These forms encode tense, aspect, person, number, and object relations within compact orthographic units.

This project treats Sindhi morphology not merely as a dictionary problem, but as a rule-governed computational system.

2. Linguistic Foundations of Sindhi Verb Morphology

Sindhi verb morphology is built around the relationship between the imperative and the infinitive. The imperative often represents the operational root, while the infinitive functions as the lemma or dictionary form.

  • لکلکڻ
  • مارمارڻ

The same morphological patterns can be applied across multiple roots. For example, patterns derived from لک can be applied to مار, producing comparable forms such as ماريندومان and مارينديمانس.

Word Form = Root + Pattern + Suffix Set

2.1 Suffix Chains

Sindhi verbal forms frequently contain suffix chains that encode grammatical information. For example:

لکندومان = لک + ند + و + مان
لکرائينديمانس = لک + را + ئين + دي + مانس

Such structures demonstrate that Sindhi verbs are not isolated word forms but layered morphological constructions.

3. Methodology: Human-in-the-Loop Annotation

Because Sindhi lacks large pre-annotated corpora, the framework adopts a human-in-the-loop approach. Native speakers and linguistic contributors annotate words through a web-based system. Each word may be tagged for part-of-speech, gender, number, tense, case, and person.

The system supports multi-label tagging because some Sindhi words may simultaneously carry more than one grammatical role. For example, a form may function as both a verb and a pronominal expression.

3.1 Annotation Categories

  • Part of Speech: verb, pronoun, noun, etc.
  • Gender: masculine, feminine, neuter/not applicable
  • Number: singular, plural, not applicable
  • Tense: past, present, future, not applicable
  • Case: nominative, accusative, genitive, vocative, etc.
  • Person/Pronoun: first person, second person, third person, speaker/addressee/other

4. From Annotation to Pattern Extraction

The central transition in this project is from word-level annotation to pattern-level abstraction. Once enough users annotate the same lexical items, the system can identify recurring morphological structures and convert them into reusable rules.

For example, if a suffix pattern is observed in forms derived from لک, and the same pattern appears in forms derived from مار, the system can treat it as a reusable morphological template rather than an isolated form.

Root: لک   + Pattern: ندومان   = لکندومان
Root: مار   + Pattern: يندومان   = ماريندومان

5. Scaling Through Combinatorial Morphology

The power of this framework lies in combinatorial expansion. If 3,000 imperative–infinitive pairs are collected and approximately 700–800 reliable patterns are identified, the system can generate millions of valid Sindhi verbal forms.

3,000 roots × 800 patterns = 2.4 million base forms

When person markers, object markers, causative forms, tense variations, and dialectal variants are added, this space may scale toward tens of millions of morphologically meaningful forms.

6. System Architecture

The framework is designed as a modular web-based architecture. It contains an annotation interface, user management system, storage layer, result engine, conflict detection mechanism, and future rule engine.

  • Annotation Layer: users tag words through controlled inputs.
  • Storage Layer: each user’s annotations are preserved separately.
  • Result Engine: tagged words are merged and conflicts are highlighted.
  • Rule Engine: future module for automatic suggestion and generation.

The design is non-destructive: raw annotations remain intact, while merged results and rule-based outputs are generated as additional layers.

7. Applications and Implications

The framework has multiple applications in Sindhi language technology. It can support lemmatization, morphological analysis, morphological generation, spell checking, grammar correction, OCR post-processing, and AI-based language modeling.

7.1 Lemmatization

Inflected forms can be mapped back to their lemmas:

  • لکرائينديمانسلکڻ
  • ماريندومانمارڻ

7.2 OCR Post-Correction

Sindhi OCR often produces merged, broken, or malformed word forms. A morphology-aware system can identify whether a word form is valid and suggest corrections based on known roots and patterns.

7.3 Universal Dependencies

The annotation categories can be aligned with Universal Dependencies features, enabling the future development of Sindhi treebanks and cross-linguistic NLP resources.

8. Discussion

The framework represents a hybrid model in which human linguistic knowledge and computational rule extraction work together. In low-resource languages, human knowledge is not a fallback; it is the foundational resource from which reliable computational systems can grow.

Rule-based morphology is especially valuable for Sindhi because the language exhibits high regularity in verbal patterns. At the same time, the system allows for ambiguity, variation, and disagreement by preserving multiple user annotations and highlighting conflicts.

The project therefore shifts the focus from isolated dataset creation to the construction of a broader linguistic infrastructure for Sindhi.

9. References

  1. Beesley, K. R., & Karttunen, L. (2003). Finite State Morphology. CSLI Publications.
  2. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing.
  3. Nivre, J., et al. (2016). Universal Dependencies v1: A multilingual treebank collection.
  4. Cotterell, R., et al. (2017). SIGMORPHON Shared Task on Morphological Reinflection.
  5. Haspelmath, M. (2002). Understanding Morphology. Arnold.
  6. Aronoff, M., & Fudeman, K. (2011). What is Morphology? Wiley-Blackwell.
  7. Sindhi Language Authority. Comprehensive Sindhi Dictionary. dic.sindhila.edu.pk
  8. Buriro, A. F. Sindhi Morphological Annotation Dataset and Language Engineering Resources. sindhilanguage.org

FAQ

Questions this article answers

What is this article about?

A Human-in-the-Loop Morphological Framework for Sindhi From Imperative–Infinitive Structures to Scalable Lemmatization and Rule-Based Language Engineering Research Overview Author Amar Fayaz Buriro Field Computational Linguistics / Sindhi NLP Language Sindhi Core Idea Human annotation + rule-based morphology Key Resources dic.sindhila.edu.pksindhilanguage.org Abstract Sindhi, a morphologically rich…

Why is this topic important?

This article contributes to a broader discussion on language, society, culture, technology or artificial intelligence.

How does this relate to Amar Fayaz’s work?

It is part of Amar Fayaz’s wider intellectual work on language, literature, civilization and artificial intelligence.

What does “Abstract” explain?

This section expands one of the central ideas discussed in the article.

What does “1. Introduction” explain?

This section expands one of the central ideas discussed in the article.