Abstract:
The YOLOv3-tiny network performs well in both accuracy and real-time for object detection. However, its complex network structure makes practical applications require targeted optimization from both software and hardware aspects. In order to meet the real-time requirements, three optimization techniques are used comprehensively. At the software level, the amount of computation is reduced through the fusion of batch normalization layer, while the low bit width to increase resource utilization.The multi-dimensional parallel FPGA computation cores are designed to match multiple convolutional layers to improve the overall throughput. Fine-grained inter-layer flow and pingpong buffer design to reduce the data transfer time. With the ZCU104 model FPGA, it achieves a detection latency of 21ms for 418 x 418 images, which exceeds similar accelerator designs and improves the DSP efficiency by 2.86 times or 8.81 times.