(RP08) Using GPU's FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
TimeTuesday, June 26th8:30am - 10am
DescriptionThe use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP64 accuracy. Our approach is based on the mixed-precision (FP16->FP64) iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations where we show how the use of FP16-TC (tensor cores) arithmetic can provide up to 4X speedup and improve the energy consumption by a factor of 5. This is due to the performance boost that the FP16->64 (Tensor Cores) provide and to its better accuracy that outperforms the classical FP16 because the GEMM accumulation occur in FP32-bit arithmetic. In addition, we will highlight, for the first time, that a V100 GPU is able to deliver 74 Gflops/Watt. One can reproduce our results as the developments will be made available through the MAGMA library.