(RP10) Deep Learning Hardware Accelerates Fused Discontinuous Galerkin Simulations
AI/Machine Learning/Deep Learning
Math Library Design
Scientific Software Development
TimeTuesday, June 26th8:30am - 10am
DescriptionIn recent years the compute/memory balance of processors has been continuously shifting towards compute. Especially the rise of Deep Learning, which is based on matrix multiplication, accelerated this path, especially in terms of single precision (FP32) and lower precision compute. An important research question is if this development can be leveraged for traditional HPC. In this work we demonstrate that a high order discontinuous Galerkin solver for seismic wave equations can execute in single precision without any loss of accuracy when running application scenarios. Additionally, we extended its kernels to support the Intel Knights Mill CPU with 14 TFLOPS of FP32 deep-learning performance. Such a performance is achieved by chaining FMA units and implementing a M=16, N=1, K=4 matrix multiplication in hardware. This works has two main contributions: a) we executed an extensive study employing several wave propagation benchmarks of the seismic community. We present misfit norms which prove that the solver can be executed in FP32 while producing negligible differences compared to double precision (FP64) b) we sketch a runtime code generator to leverage deep-learning oriented instructions (targeting large dense matrix multiplication) to run small sparse times dense matrix multiplications for several right hand sides, called fused simulations. One Knights Mill chip can obtain the same application performance as the latest top-bin dual socket Intel Xeon Platinum HPC server, while consuming significantly lower power. Compared to the HPC-focused Knights Landing processor speed-ups of up to 1.6x are possible depending on the scenario.