Research on the Fault-tolerant of GEMM computation for GPU
-
摘要:
矩阵计算是GPU最擅长的工作之一, NVIDIA公司在CUDA中提供了线性代数库cuBLAS, 用于矩阵和向量相关的计算.但是GPU容易受到电磁或者宇宙射线影响, 而发生"位"反转问题, 从而发生静默数据损坏错误.针对这个问题, 利用基于算法的容错方法, 提出了带容错的, 用于通用矩阵乘法计算的方法, 并以CUDA库函数的方式实现.论文讨论了算法的原理, 用一种高效的方法实现了容错计算, 并提出了一个低开销、高准确率的阈值计算方法用于在线的快速纠错和检错.在两款嵌入式GPU平台上对带容错功能的GEMM库函数进行了评估, 其纠错和检错能力与预期一致, 并且在大部分情况下, 额外性能开销能够控制在50%以内, 证明了该GEMM函数可以在较低的性能开销情况下, 能够很好的实现GEMM计算的检错和纠错, 在某些结果-关键的高性能计算中, 具有一定的实用价值.
Abstract:Matrix computing is one of the work that GPUs are good at. NVIDIA provides a linear algebra library cuBLAS, in CUDA for the calculations related with matrices and vectors. However, GPU is vulnerable to the problem of bit inversion due to electromagnetic interference or cosmic rays, resulting in silent data corruption (SDC) errors. In order to resolve this problem, a general CUDA library function with fault tolerance for General Matrix Multiplication (GEMM) is implemented by using the algorithm-based fault-tolerant method (ABFT). The principle and implementation of the algorithm, as well as the judging mechanism in the process of error correction and detection are fully discussed.The GEMM library function with fault tolerance is evaluated on two embedded GPU platforms, and its error correction and error detection capabilities are consistent with expectations, and the additional performance overhead can be controlled within 50%. It is proved that this GEMM function with fault-tolerant can well perform detection and correction of the GEMM computing with lower performance overhead. In some result-key high performance computing applications, this function has more practical application value.
-
表 1 ABFT算法的检错和纠错能力
序号 情况 错误检测 错误纠正 1 s=0且t=0 无错误 无需纠错 2 s=1且t=0或s=0且t=1 校验和向量错误 无需纠错 3 s=1且t≥1或s≥1且t=1 结果矩阵C中出现错误的元素为C (i, j1), C (i, j2)…C (i, jt). 根据式(1)进行纠错 4 s≥1且t≥1 错误元素为C(im, jn), 其中M∈{1, 2…s}, n∈{1, 2…t} 错误无法被纠正 -
[1] PETERSEN E.空间单粒子效应--影响航天电子系统的危险因素[M].韩郑生, 沈自才, 丁义刚, 等译.北京: 电子工业出版社, 2016.PETERSEN E. Single event effects in aerospace[M]. HAN Z S, SHEN Z C, DING Y G, et al trans. Beijing: Publishing House of Electronics Industry, 2016. [2] 徐丹妮, 贺占庄.一种基于GPU通用计算的容错方法[J].微电子学与计算机, 2014, 31(2): 18-22. DOI: 10.19304/j.cnki.issn1000-7180.2014.02.005.XU D N, HE Z Z. A fault tolerance method based on GPGPU[J]. Microelectronics & Computer, 2014, 31(2): 18-22.DOI: 10.19304/j.cnki.issn1000-7180.2014.02.005. [3] DE DAG, PILLA L L, SANTINI T, et al. Evaluation and mitigation of radiation-induced soft errors in graphics processing units[J].IEEE Transactions on Computers, 2016, 65(3): 791-804. DOI: 10.1109/TC.2015.2444855. [4] XU X H, YANG P, MA Z C, et al. A fast and iterative migration for GPU applications[C]//Proceedings of theFourth International Conference on Information Science & Cloud Computing. Guangzhou, China, 2015. DOI: 10.22323/1.264.0051. [5] ZENO L, MENDELSON A, SILBERSTEIN M. GPUpIO: the case for I/O-driven preemption on GPUs[C]//Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit. Barcelona, Spain: ACM, 2016: 63-71. DOI: 10.1145/2884045.2884053. [6] 孟晨, 曹宗雁, 王龙, 等.基于Charm++运行时环境的异构计算应用容错研究[J].计算机工程与应用, 2016, 52(13): 1-7. DOI: 10.3778/j.issn.1002-8331.1601-0299.MENG C,CAO Z Y, WANG L, et al. Charm++ RTS based fault tolerance mechanism of heteroge-neous computing[J]. Computer Engineering and Applications, 2016, 52(13): 1-7.DOI: 10.3778/j.issn.1002-8331.1601-0299. [7] HUANG K H, ABRAHAM J A.Algorithm-based fault tolerance for matrix operations[J]. IEEE Transactions on Computers, 1984, C-33(6): 518-528. DOI: 10.1109/TC.1984.1676475. [8] CHEN J Y, LIANG X, CHEN Z Z. Online algorithm-based fault tolerance for choleskydecomposition on heterogeneous systems with GPUs[C]//Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. Chicago, IL, USA: IEEE, 2016: 993-1002. DOI: 10.1109/IPDPS.2016.81. [9] BRAUN C, HALDER S, WUNDERLICH H. A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units[C]//Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Atlanta, GA, USA: IEEE, 2014: 443-454. DOI: 10.1109/DSN.2014.48. [10] WHITEHEAD N, FIT-FLOREA A. Precision and performance: floating point and IEEE 754 compliance for NVIDIA GPUs[R]. Santa Clara, USA: NVIDIA Corporation, 2011. -