Birds of a Feather
Batched BLAS Standardization
Event Type
Birds of a Feather
HPC Accelerators
Math Library Design
Parallel Algorithms
Parallel Applications
Scientific Software Development
TimeTuesday, June 26th1:45pm - 2:45pm
DescriptionIn recent years. the state-of-the-art approaches for addressing large-scale problems are undergoing a tremendous change. It is becoming increasingly common in many scientific fields to decompose large-scale simulation problems into very small linear algebra operations that can be computed in parallel. The representative applications from a variety of scientific fields that exhibit this kind of computing patterns include tensor contractions codes for the quantum Hall effect, astrophysics calculations, CFD and the resulting PDE solvers, quantum chemistry calculation, image analysis, and signal processing. These problems are too small to use modern HPC systems and the associated optimized libraries at full efficiency. Nevertheless, the fact that one has to solve thousands of these problems independently suggests it is worth designing new linear algebra libraries. Consequently, batched BLAS algorithms have been introduced to solve thousands of small BLAS operations with only one function call. The computational science community and optimized linear algebra libraries are actively working on implementations that fulfill the need for optimized batched BLAS-like kernels. However, the batched BLAS interfaces currently provided by Intel MKL, NVIDIA cuBLAS, MAGMA, and other libraries differ significantly from each other, which results in a serious portability issue. The calling interfaces and optimal data layout for data storage of a batch of small matrices necessary for good performance vary depending on architecture. To propose an objective standard without a severe performance penalty for any architecture, a first attempt was made by analyzing the benefits and drawbacks of existing batched BLAS interfaces. This BOF continued these efforts.