基于客户端-服务器的容错神经网络训练架构

何梦; 许达文

doi:10.19304/J.ISSN1000-7180.2021.0035

基于客户端-服务器的容错神经网络训练架构

何梦,
许达文

Fault-tolerant neural network training framework based on client-server

He Meng,
Xu Dawen

摘要

摘要: 为了实现低功耗和实时推理，AIoT设备近年来被应用于深度学习中的多个领域.然而，一些制造工艺导致AIoT设备在推理时会出现软错误.对于具有大量计算的神经网络加速器来说，可能会导致大量的计算误差和巨大的预测精度损失，这对于像自主无人机这样精度敏感的应用来说是无法忍受的.而传统的容错技术(如三重模块化冗余)会带来相当大的功耗和性能损失.本文提出了一种客户端-服务器协同的容错神经网络训练框架.在训练中采用带有软错误的AIoT处理器作为客户端，然后服务器端通过AIoT设备的应用数据学习到计算错误.实验中选取了多个具有代表性的神经网络模型.相比于离线训练的模型，该方法训练的模型使神经网络的top5精度平均提高2.8%.

Abstract: In order to realize low power consumption and real-time inference, AIoT devices have been applied in many fields of deep learning in recent years. However, some manufacturing processes cause some soft errors on AIOT devices in inference. For a neural network accelerator with a large amount of computation, it may lead to a large amount of computing error and a huge loss of prediction accuracy, which is intolerable for precision-sensitive applications such as autonomous drones. However, conventional fault tolerance techniques such as triple modular redundancy can incur considerable power consumption and performance penalty. In this paper, a client-server collaborative fault-tolerant neural network training framework is proposed. In the training, an AIoT processor with soft errors is used as the client, and the server learns the on-site computing errors with the application data of AIoT processor. Several representative neural network models were selected in the experiment. Compared with the off-line training model, the model trained by this method increases the top5 accuracy of the neural network by an average of 2.8%.

HTML全文

参考文献(9)

施引文献

资源附件(0)