Aiming at the problems that the model parameters of video interpolation algorithm based on convolutional neural network are too large, poor real-time, high memory occupation and difficult to be widely used, a lightweight cascading inference based on bidirectional optical flow and multi-scale feature fusion is proposed. The network model decomposes the task of video frame insertion into two steps of inter-frame motion synthesis and texture reconstruction, designs a lightweight two-way optical flow prediction network, and proposes a multi-scale spatial and texture feature fusion network model. The multi-scale texture features and complex motion features of video frames are fully extracted and utilized. The model uses the position of two adjacent video frames and the required intermediate frames as network inputs. First, the spatial pyramid features and texture pyramid features of the two input frames are calculated. Then the spatial pyramid features are used to calculate the multi-scale bidirectional optical flow between frames, meanwhile, calculating the spatial and texture features of the intermediate frame. Finally, a fusing network is introduced to generate the multi-scale space and texture features of the intermediate frame to generate the final required video frame. Experiments on the Vimeo90K and UCF101 datasets show that, under the premise of guaranteeing accuracy, the algorithm in this paper has better performance in terms of calculation speed and model parameters.