Trans_E2N: Transformer with external attention and double layer normalization for image captioning

PAN Longyue; WU Chunyan; AN Yongzhi; YANG You

doi:10.19304/J.ISSN1000-7180.2022.0390

PAN L Y，WU C Y，AN Y Z，et al. Trans_E2N: Transformer with external attention and double layer normalization for image captioning[J]. Microelectronics & Computer，2023，40（3）：1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390

Citation:

Trans_E2N: Transformer with external attention and double layer normalization for image captioning

Abstract

Abstract

The purpose of the image captioning task is to generate syntactically accurate and semantically coherent sentences to express the content of a given image, which has a great practical value. The transformer model has a significant advantage in accomplishing this task. In order to solve the problems of Transformer with high attention complexity and internal covariance shift during training, the image captioning model based on external attention and double layer normalization is proposed. On the one hand, external attention is used on the encoding side, it adopts a learnable, external and shared memory unit to reduce the complexity from the second power to the first power, and learns prior knowledge based on the whole dataset, the potential correlation between the samples is mined, which makes the captions generated by the model more accurate. Meanwhile, the row and column of attention matrix are normalized to eliminate the influence of input feature size on attention. On the other hand, the concept of double layer normalization is proposed and used in the Transformer model, and it improves the data expression ability while maintaining the stability of the input data’s distribution. Compared with Up-Down, SRT, M2 and other representative models, simulation experiments on the MS COCO data set show that the improved model achieved scores of 29.3, 58.6, 131.7 and 22.7 in METEOR, ROUGE, CIDEr and SPICE respectively. Experimental results show that the semantic expression of the improved model is more sufficient, the description is more accurate, and the improvement is effective..

FullText(HTML)

References (31)

Relative Articles

Supplements (0)

Cited By

Turn off MathJax

Article Contents

Trans_E2N: Transformer with external attention and double layer normalization for image captioning

Abstract

Catalog

Export File

Citation

Format

Content