PAN L Y,WU C Y,AN Y Z,et al. Trans_E2N: Transformer with external attention and double layer normalization for image captioning[J]. Microelectronics & Computer,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390
Citation: PAN L Y,WU C Y,AN Y Z,et al. Trans_E2N: Transformer with external attention and double layer normalization for image captioning[J]. Microelectronics & Computer,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390

Trans_E2N: Transformer with external attention and double layer normalization for image captioning

  • The purpose of the image captioning task is to generate syntactically accurate and semantically coherent sentences to express the content of a given image, which has a great practical value. The transformer model has a significant advantage in accomplishing this task. In order to solve the problems of Transformer with high attention complexity and internal covariance shift during training, the image captioning model based on external attention and double layer normalization is proposed. On the one hand, external attention is used on the encoding side, it adopts a learnable, external and shared memory unit to reduce the complexity from the second power to the first power, and learns prior knowledge based on the whole dataset, the potential correlation between the samples is mined, which makes the captions generated by the model more accurate. Meanwhile, the row and column of attention matrix are normalized to eliminate the influence of input feature size on attention. On the other hand, the concept of double layer normalization is proposed and used in the Transformer model, and it improves the data expression ability while maintaining the stability of the input data’s distribution. Compared with Up-Down, SRT, M2 and other representative models, simulation experiments on the MS COCO data set show that the improved model achieved scores of 29.3, 58.6, 131.7 and 22.7 in METEOR, ROUGE, CIDEr and SPICE respectively. Experimental results show that the semantic expression of the improved model is more sufficient, the description is more accurate, and the improvement is effective..
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return