王智捷, 任健, 廖磊. 基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法[J]. 微电子学与计算机, 2022, 39(11): 45-53. DOI: 10.19304/J.ISSN1000-7180.2022.0148
引用本文: 王智捷, 任健, 廖磊. 基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法[J]. 微电子学与计算机, 2022, 39(11): 45-53. DOI: 10.19304/J.ISSN1000-7180.2022.0148
WANG Zhijie, REN Jian, LIAO Lei. Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention[J]. Microelectronics & Computer, 2022, 39(11): 45-53. DOI: 10.19304/J.ISSN1000-7180.2022.0148
Citation: WANG Zhijie, REN Jian, LIAO Lei. Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention[J]. Microelectronics & Computer, 2022, 39(11): 45-53. DOI: 10.19304/J.ISSN1000-7180.2022.0148

基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法

Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention

  • 摘要: 目前视觉跟踪技术易忽视人物与场景图之间的联系、以及缺少对联合注意力的分析和检测,导致检测性能不理想.为此提出一种基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法.对于给定任意一幅图像,利用深度神经网络来提取人物的头部特征后,加入场景和头部之间的交互可以帮助增强图像的显著性,并引入一个强化注意力模块来过滤掉深度和视野上的干扰信息.此外,将场景中其余人物的注意力也考虑进所关注的区域,通过注意推送来增强标准显著性模型.加入时空注意力机制后,可以有效地将候选目标、目标注视方向和时间帧数约束结合起来,达到识别共享位置,利用显著性信息能够更好地检测和定位联合注意力.最后将图像中的注意力以热力图的形式可视化.实验表明:该模型能够有效地推断视频中的动态注意力和联合注意力,且效果良好.

     

    Abstract: The current visual tracking technology tends to ignore the connection between the figure and the scene graph, as well as the lack of analysis and detection of joint attention, which results in unsatisfactory detection performance. In response to these problems, this paper proposed a visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention. For any given image, the method extracts the head features of a person by using a deep neural network, and then adds extra-interaction between the scene and the head to enhance the saliency of images. Lots of interference information on the depth and field of view can be filtered out by the enhanced attention module. In addition, the attention of the remaining characters in the scene is considered into the area of interest to improve the standard saliency model. After adding the spatiotemporal attention mechanism, the candidate target, target gaze direction and time frame number constraints can be effectively combined to identify the shared location, and the saliency information can be used to detect and locate joint attention better. Finally, the image is visualized as a heat map. Experiments show that the model can effectively infer dynamic attention and joint attention in videos with good results.

     

/

返回文章
返回