仇越, 马文涛, 柴志雷. 一种基于FPGA的卷积神经网络加速器设计与实现[J]. 微电子学与计算机, 2018, 35(8): 68-72, 77.
引用本文: 仇越, 马文涛, 柴志雷. 一种基于FPGA的卷积神经网络加速器设计与实现[J]. 微电子学与计算机, 2018, 35(8): 68-72, 77.
QIU Yue, MA Wen-tao, CHAI Zhi-lei. Design and Implementation of a Convolutional Neural Network Accelerator Based on FPGA[J]. Microelectronics & Computer, 2018, 35(8): 68-72, 77.
Citation: QIU Yue, MA Wen-tao, CHAI Zhi-lei. Design and Implementation of a Convolutional Neural Network Accelerator Based on FPGA[J]. Microelectronics & Computer, 2018, 35(8): 68-72, 77.

一种基于FPGA的卷积神经网络加速器设计与实现

Design and Implementation of a Convolutional Neural Network Accelerator Based on FPGA

  • 摘要: 针对卷积神经网络模型ZynqNet现有FPGA实现版本中卷积运算单元并行度低, 存储结构过度依赖片外存储等问题, 提出一种针对ZynqNet的FPGA优化设计.设计了双缓冲结构将中间运算结果放到片内以减少片外存储访问; 将数据位宽从32位降为16位; 设计了具有64个卷积运算单元的并行结构.实验结果表明, 在ImageNet测试准确度相同的情况下, 本文所提出的设计工作频率可达200 MHz, 运算速率峰值达到1.85 GMAC/s, 是原ZynqNet实现的10倍, 相比i5-5200U CPU可实现20倍加速.同时, 其计算能效达到了NVIDIA GTX 970GPU的5.4倍.

     

    Abstract: In the hardware design of ZynqNet implemented on FPGA, the parallelism of convolution unit is low and the storage structure is almost dependent on off-chip memory. A FPGA accelerator optimization is proposed based on ZynqNet and it is easy to apply in other CNN models. The double buffering stores intermediate result of the network into the chip to reduce off-chip access; The data precision is changed from 32 bits to 16 bits, thus a parallel structure of 64 convolution operation units is designed to improve computing parallelism. The ImageNet results show that the optimized accelerator based on FPGA can achieve peak performance of 1.85 GMAC/s under 200 MHz, it is 10 times speedup compared to the original ZynqNet and 20 times speedup compared to i5-5200U CPU. In terms of performance power ratio, the FPGA accelerator is 5.4 times of NVIDIA GTX 970GPU version.

     

/

返回文章
返回