何梦, 许达文. 基于客户端-服务器的容错神经网络训练架构[J]. 微电子学与计算机, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035
引用本文: 何梦, 许达文. 基于客户端-服务器的容错神经网络训练架构[J]. 微电子学与计算机, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035
He Meng, Xu Dawen. Fault-tolerant neural network training framework based on client-server[J]. Microelectronics & Computer, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035
Citation: He Meng, Xu Dawen. Fault-tolerant neural network training framework based on client-server[J]. Microelectronics & Computer, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035

基于客户端-服务器的容错神经网络训练架构

Fault-tolerant neural network training framework based on client-server

  • 摘要: 为了实现低功耗和实时推理,AIoT设备近年来被应用于深度学习中的多个领域.然而,一些制造工艺导致AIoT设备在推理时会出现软错误.对于具有大量计算的神经网络加速器来说,可能会导致大量的计算误差和巨大的预测精度损失,这对于像自主无人机这样精度敏感的应用来说是无法忍受的.而传统的容错技术(如三重模块化冗余)会带来相当大的功耗和性能损失.本文提出了一种客户端-服务器协同的容错神经网络训练框架.在训练中采用带有软错误的AIoT处理器作为客户端,然后服务器端通过AIoT设备的应用数据学习到计算错误.实验中选取了多个具有代表性的神经网络模型.相比于离线训练的模型,该方法训练的模型使神经网络的top5精度平均提高2.8%.

     

    Abstract: In order to realize low power consumption and real-time inference, AIoT devices have been applied in many fields of deep learning in recent years. However, some manufacturing processes cause some soft errors on AIOT devices in inference. For a neural network accelerator with a large amount of computation, it may lead to a large amount of computing error and a huge loss of prediction accuracy, which is intolerable for precision-sensitive applications such as autonomous drones. However, conventional fault tolerance techniques such as triple modular redundancy can incur considerable power consumption and performance penalty. In this paper, a client-server collaborative fault-tolerant neural network training framework is proposed. In the training, an AIoT processor with soft errors is used as the client, and the server learns the on-site computing errors with the application data of AIoT processor. Several representative neural network models were selected in the experiment. Compared with the off-line training model, the model trained by this method increases the top5 accuracy of the neural network by an average of 2.8%.

     

/

返回文章
返回