Research on the Fault-tolerant of GEMM computation for GPU
-
Abstract
Matrix computing is one of the work that GPUs are good at. NVIDIA provides a linear algebra library cuBLAS, in CUDA for the calculations related with matrices and vectors. However, GPU is vulnerable to the problem of bit inversion due to electromagnetic interference or cosmic rays, resulting in silent data corruption (SDC) errors. In order to resolve this problem, a general CUDA library function with fault tolerance for General Matrix Multiplication (GEMM) is implemented by using the algorithm-based fault-tolerant method (ABFT). The principle and implementation of the algorithm, as well as the judging mechanism in the process of error correction and detection are fully discussed.The GEMM library function with fault tolerance is evaluated on two embedded GPU platforms, and its error correction and error detection capabilities are consistent with expectations, and the additional performance overhead can be controlled within 50%. It is proved that this GEMM function with fault-tolerant can well perform detection and correction of the GEMM computing with lower performance overhead. In some result-key high performance computing applications, this function has more practical application value.
-
-