面向GPU的通用矩阵乘法计算的容错研究

包冲; 张善从

面向GPU的通用矩阵乘法计算的容错研究

包冲,
张善从

Research on the Fault-tolerant of GEMM computation for GPU

摘要

摘要: 矩阵计算是GPU最擅长的工作之一, NVIDIA公司在CUDA中提供了线性代数库cuBLAS, 用于矩阵和向量相关的计算.但是GPU容易受到电磁或者宇宙射线影响, 而发生"位"反转问题, 从而发生静默数据损坏错误.针对这个问题, 利用基于算法的容错方法, 提出了带容错的, 用于通用矩阵乘法计算的方法, 并以CUDA库函数的方式实现.论文讨论了算法的原理, 用一种高效的方法实现了容错计算, 并提出了一个低开销、高准确率的阈值计算方法用于在线的快速纠错和检错.在两款嵌入式GPU平台上对带容错功能的GEMM库函数进行了评估, 其纠错和检错能力与预期一致, 并且在大部分情况下, 额外性能开销能够控制在50%以内, 证明了该GEMM函数可以在较低的性能开销情况下, 能够很好的实现GEMM计算的检错和纠错, 在某些结果-关键的高性能计算中, 具有一定的实用价值.

Abstract: Matrix computing is one of the work that GPUs are good at. NVIDIA provides a linear algebra library cuBLAS, in CUDA for the calculations related with matrices and vectors. However, GPU is vulnerable to the problem of bit inversion due to electromagnetic interference or cosmic rays, resulting in silent data corruption (SDC) errors. In order to resolve this problem, a general CUDA library function with fault tolerance for General Matrix Multiplication (GEMM) is implemented by using the algorithm-based fault-tolerant method (ABFT). The principle and implementation of the algorithm, as well as the judging mechanism in the process of error correction and detection are fully discussed.The GEMM library function with fault tolerance is evaluated on two embedded GPU platforms, and its error correction and error detection capabilities are consistent with expectations, and the additional performance overhead can be controlled within 50%. It is proved that this GEMM function with fault-tolerant can well perform detection and correction of the GEMM computing with lower performance overhead. In some result-key high performance computing applications, this function has more practical application value.

HTML全文

参考文献(10)

施引文献

资源附件(0)