潘龙越,吴春燕,安永志,等.Trans_E2N:外部注意和二次层归一化的图像描述生成[J]. 微电子学与计算机,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390
引用本文: 潘龙越,吴春燕,安永志,等.Trans_E2N:外部注意和二次层归一化的图像描述生成[J]. 微电子学与计算机,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390
PAN L Y,WU C Y,AN Y Z,et al. Trans_E2N: Transformer with external attention and double layer normalization for image captioning[J]. Microelectronics & Computer,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390
Citation: PAN L Y,WU C Y,AN Y Z,et al. Trans_E2N: Transformer with external attention and double layer normalization for image captioning[J]. Microelectronics & Computer,2023,40(3):1-9. doi: 10.19304/J.ISSN1000-7180.2022.0390

Trans_E2N:外部注意和二次层归一化的图像描述生成

Trans_E2N: Transformer with external attention and double layer normalization for image captioning

  • 摘要: 图像描述生成任务的目的是生成语法准确、语义连贯的句子,以表达给定图像的内容,具有很好的实用价值. Transformer模型在完成该任务时获得显著优势. 针对Transformer存在注意力复杂度高以及训练期间产生内部协变量偏移的问题,提出外部注意和二次层归一化的图像描述生成模型. 一方面,在编码端使用外部注意力,通过可学习的外部共享记忆单元,将注意力机制的复杂度从二次幂降为一次幂,并学习到基于整个数据集的先验知识,挖掘了样本间的潜在相关性,使模型生成的文本描述内容更加准确. 同时,分别对注意力图的行和列作归一化处理,消除输入特征大小对注意力的影响. 另一方面,定义二次层归一化并用于Transformer模型中,在维持输入数据分布稳定的同时,提高了数据表达能力. 在MS COCO数据集上进行的仿真实验表明,相对于Up-Down、SRT、M2等代表性模型,改进后的模型在METEOR、ROUGE、CIDEr和SPICE指标上分别取得了29.3、58.6、131.7和22.7的分数. 实验结果表明,改进后模型的语义表达更加充分、描述更加准确,改进是有效的.

     

    Abstract: The purpose of the image captioning task is to generate syntactically accurate and semantically coherent sentences to express the content of a given image, which has a great practical value. The transformer model has a significant advantage in accomplishing this task. In order to solve the problems of Transformer with high attention complexity and internal covariance shift during training, the image captioning model based on external attention and double layer normalization is proposed. On the one hand, external attention is used on the encoding side, it adopts a learnable, external and shared memory unit to reduce the complexity from the second power to the first power, and learns prior knowledge based on the whole dataset, the potential correlation between the samples is mined, which makes the captions generated by the model more accurate. Meanwhile, the row and column of attention matrix are normalized to eliminate the influence of input feature size on attention. On the other hand, the concept of double layer normalization is proposed and used in the Transformer model, and it improves the data expression ability while maintaining the stability of the input data’s distribution. Compared with Up-Down, SRT, M2 and other representative models, simulation experiments on the MS COCO data set show that the improved model achieved scores of 29.3, 58.6, 131.7 and 22.7 in METEOR, ROUGE, CIDEr and SPICE respectively. Experimental results show that the semantic expression of the improved model is more sufficient, the description is more accurate, and the improvement is effective..

     

/

返回文章
返回