Abstract:
Implementing object detection algorithms, such as YOLO, in FPGA requires multi-level optimization, starting from model quantization to hardware optimization. To optimize hardware latency, three techniques are used: (1) bit-width quantization and layer fusion strategies are used to minimize the computation complexity, (2) a column-based pipeline architecture with padding skip technique is introduced to reduce the start-up time of pipeline and (3) a design space exploration algorithm is proposed to balance the pipeline and improve the DSP efficiency. To demonstrate the proposed neural network accelerator architecture, YOLO with 1 280×384 input is implemented on ZC706 FPGA and achieves a 1.97× latency reduction or a 1.54× DSP efficiency improvement over traditional accelerators.