A HYBRID IMAGE CAPTIONING FRAMEWORK WITH EFFICIENTNETB0 AND TRANSFORMER NETWORKS
Keywords:
A HYBRID IMAGE CAPTIONING FRAMEWORK, WITH EFFICIENTNETB0, AND TRANSFORMER NETWORKSAbstract
This research investigates an image captioning model that integrates EfficientNetB0 for feature extraction and a Transformer-based encoder-decoder for caption generation, Leveraging the balanced scaling of EfficientNetB0, the model efficiently captures rich visual representations, which are then translated into descriptive textual captions by the Transformer architecture, The proposed model is trained and evaluated on the Flickr8k dataset, utilizing 85% of the data for training, The results show that the model achieves 58% captioning accuracy after 35 epochs and attains a BLEU score of 0,75, showing competitive performance while maintaining a relatively lightweight architecture.