Abstract:
In order to realize low power consumption and real-time inference, AIoT devices have been applied in many fields of deep learning in recent years. However, some manufacturing processes cause some soft errors on AIOT devices in inference. For a neural network accelerator with a large amount of computation, it may lead to a large amount of computing error and a huge loss of prediction accuracy, which is intolerable for precision-sensitive applications such as autonomous drones. However, conventional fault tolerance techniques such as triple modular redundancy can incur considerable power consumption and performance penalty. In this paper, a client-server collaborative fault-tolerant neural network training framework is proposed. In the training, an AIoT processor with soft errors is used as the client, and the server learns the on-site computing errors with the application data of AIoT processor. Several representative neural network models were selected in the experiment. Compared with the off-line training model, the model trained by this method increases the top5 accuracy of the neural network by an average of 2.8%.