Abstract:
In this paper, we address the video-to-video person re-id. The proposed network is mainly composed of the feature representation subnetwork and the similarity measure network. First, the residual network is used to extract features from each frame of video, and then the features are inputtedinto the Long Short-Term Memory network to obtain features which contain information of time and space. Aweight module isappliedon the upper layer of Long Short-Term Memory network.Inthis model, applying an attentive quality mechanism to assign appropriate weight for each frame.Then, the weighted feature of each video sequence will be inputted into the similaritymeasure sub-network to measure similarity. In this framework, use fully connected layers to connect feature representation subnetwork and the similarity measure network, so the feature representation and similarity metric learning can be learned and optimized at the same time. Finally, we do experiments on two public datasets to prove that our network model can improve pedestrian recognition accuracy and performance.