A RULE-BASED APPROACH TO PASHTO TEXT PREPROCESSING: A TECHNIQUE FOR NORMALIZATION, STEMMING, AND TF-IDF INDEXING (RATP-TFI)

Abdul Qadir (Achakzai); Ihsan Ullah

Authors

Abdul Qadir (Achakzai)
Ihsan Ullah

Keywords:

Text Preprocessing, Natural language processing (NLP), Low-Resource language, Pashto Linguistic, Corpus Creation, Pashto Text Normalization, Tokenization, Stop word removal, stemming, Lemmatization, Noise Removal, Morphological Analysis, ROUGE Evaluation, TF-IDF weighting Technique, Rule Based Model, Pashto Language

Abstract

Text pre-processing is one of the Fundamental steps in Natural language Processing (NLP), mainly for low-resource languages like Pashto. which focuses and impact the overall performance of the study by developing acclimate pre-processing techniques to address the unique linguistic, orthographic variations, dialectal differences and challenges of the Pashto language which is a low-resource language. Key parts include Pashto-specific normalization rules of script variation to handle, stop-word list, stemming algorithm to reduce inflectional diversity. The main objective of Text pre-processing is to prepare raw text data for accurate and efficient processing by machine learning models. which transforms noisy, inconsistent and unstructured text into a standardize or structured format suitable for computation and analysis and enhance computational efficiency, facilitate model understanding, standardize text representation and improve model performance.

This model A Rule-Based Approach to Pashto Text Preprocessing: A Technique for Normalization, Stemming, and TF-IDF Indexing (RATP-TFI) performs important NLP tasks like Normalization, Tokenization, Stemming, POS tagging and Stop-word removal for Pashto. The Pashto Text Corpus (PTC) consisting of 30k Pashto text documents which are collected from different sources like Websites, social media, news, books and Pashto Academy Quetta, Pakistan.

A rule-based system tags words assigns grammatic and semantic tags to words using predefined linguistic rules advancing in Language specific-customization, no dependency on large datasets are required, Doman specific adaptation (fine-tuned for specific domains), Resilience in limited resources and foundation for further researchers with their part of speech in Pashto input text. The ROUGE metric evaluation is used for assessing the quality, effectiveness and providing the preprocessing text to handle Pashto’s morphology, tokenization and orthographic variations ensuring the accurate comparisons of generating and referencing of summaries. The proposed RATP-TFI technique achieving 93% accuracy on the PTC corpus.