A HYBRID IMAGE CAPTIONING FRAMEWORK WITH EFFICIENTNETB0 AND TRANSFORMER NETWORKS

Authors

  • Yasir Afnan
  • , Kifayat Ullah
  • Bilal Ur Rehman
  • Inam Ul Hassan
  • Maria Zulfiqar
  • Zawish Asif
  • Wasim Habib
  • Muhammad Amir
  • Muhammad Arshad

Keywords:

A HYBRID IMAGE CAPTIONING FRAMEWORK, WITH EFFICIENTNETB0, AND TRANSFORMER NETWORKS

Abstract

This research investigates an image captioning model that integrates EfficientNetB0 for feature extraction and a Transformer-based encoder-decoder for caption generation, Leveraging the balanced scaling of EfficientNetB0, the model efficiently captures rich visual representations, which are then translated into descriptive textual captions by the Transformer architecture, The proposed model is trained and evaluated on the Flickr8k dataset, utilizing 85% of the data for training, The results show that the model achieves 58% captioning accuracy after 35 epochs and attains a BLEU score of 0,75, showing competitive performance while maintaining a relatively lightweight architecture.

Downloads

Published

2025-07-23

How to Cite

Yasir Afnan, , Kifayat Ullah, Bilal Ur Rehman, Inam Ul Hassan, Maria Zulfiqar, Zawish Asif, Wasim Habib, Muhammad Amir, & Muhammad Arshad. (2025). A HYBRID IMAGE CAPTIONING FRAMEWORK WITH EFFICIENTNETB0 AND TRANSFORMER NETWORKS. Spectrum of Engineering Sciences, 3(7), 888–898. Retrieved from https://sesjournal.com/index.php/1/article/view/668