ADAPTIVE MULTI MODAL ANNOTATION FOR HIGH QUALITY, SCALABLE MACHINE LEARNING DATA PIPELINES

Shah Faisal; Zahid Mehmood; Muhammad Abdul Rafay; Umama Abbasi

Authors

Shah Faisal
Zahid Mehmood
Muhammad Abdul Rafay
Umama Abbasi

Keywords:

ADAPTIVE MULTI MODAL ANNOTATION, FOR HIGH QUALITY, SCALABLE MACHINE, LEARNING DATA PIPELINES

Abstract

The shift to data-centric artificial intelligence emphasizes high-quality labeled data as a cornerstone of machine learning model performance. Manual annotation, however, is labor-intensive, costly, and prone to inconsistencies, limiting scalability for large datasets. This paper proposes the Adaptive Multi-Modal Annotation Framework (AMAF), a novel system integrating weak supervision, large language model-based labeling, and active learning to automate data annotation in ML pipelines. We introduce Dynamic Synthetic Data Augmentation, a technique to generate diverse, domain-specific datasets, addressing bias and scalability issues. Implemented with Snorkel and MLflow, AMAF was evaluated across healthcare (radiology image labeling), natural language processing (intent classification), and autonomous vehicles (object detection). Results demonstrate 18–20% higher label accuracy and 20–30% faster annotation cycles compared to human baselines, with downstream models achieving 7–10% F1-score improvements over tools like Label Studio and Amazon SageMaker Ground Truth. Challenges include domain-specific complexities and rule-based limitations