基于高带宽存储器的RNN加速设计

吴茜凤; 巩杰; 范军; 何虎

doi:10.19304/J.ISSN1000-7180.2021.002310.19304/J.ISSN1000-7180.2021.0023

基于高带宽存储器的RNN加速设计

Accelerating RNNs on FPGA with HBM

摘要

摘要: 针对循环神经网络的算法性能受带宽限制问题，设计了基于HBM存储器的循环神经网络加速SoC，可通用的支持RNN及其变体的推理过程.首先比较RNN及其变体结构，分析算法计算需求和存储带宽需求.然后，提出基于HBM的高带宽加速器设计，将其部署在VCU128开发板上.最后采用Roofline模型分析方法，提高带宽和计算密度，测试DeepSpeech2和GNMT算法的推理平均性能分别为61.74 GFLOPs/sec，20 GFLOPs/sec.对比基于DDR存储器的设计，性能提高3.68倍.对比其他基于FPGA循环神经网络加速的浮点32位设计，性能提高8.5倍.该设计提出了针对3D高带宽存储器的数据调用方法，并能适应不同的循环神经网络应用.

Abstract: Aiming at the problem that the algorithm of the recurrent neural network is limited by bandwidth, an accelerated SoC based on HBM is designed, which can universally support the RNN and its variants. First, the structure of RNN and its variants, and the calculation requirements and storage requirements of the algorithms., a high-bandwidth accelerator design based on HBM was proposed and deployed on the Xilinx VCU128 development board. Finally, according to the Roofline model analysis method, the bandwidth and calculation density are imprved. The average inference performance of testing DeepSpeech2 and GNMT algorithms are 61.74 GFLOPs/sec and 20GFLOPs/sec respectively. Compared with the design based on DDR memory, the performance is improved by 3.68 times. Compared with the accelerated design of other floating-point 32-bit FPGA-based recurrent neural networks, the performance is improved by 8.5 times. This design proposes a data scheduling method for multi-channel memory and can adapt to different recurrent neural network applications.

HTML全文

参考文献(6)

施引文献

资源附件(0)