(RP14) Automatic Generation of Full-Set Batched BLAS
Math Library Design
Scientific Software Development
TimeTuesday, June 26th8:30am - 10am
DescriptionBatched Basic Linear Algebra Subprograms (batched BLAS) is a new BLAS interface which computes multiple independent BLAS operations as a single subroutine. On many-core processors, a small size problem may not utilize the computation power of all the cores. Batched BLAS is a solution to utilize many cores effectively. Some of high-demanded batched BLAS routines have been developed for CPU/XeonPhi and GPUs, but a full set of the BLAS routines (including level-1/2/3 routines) has not been provided yet. This study presents the first implementation of the level 1-2-3 full-set variable Batched (vbatched) BLAS interface. To develop this BLAS, we developed an efficient development method to develop a full set of batched BLAS routines using automatic code generation with some existing standard BLAS implementation. In our method, Batched BLAS source files are generated by our automatic code generator implemented in Python based on a routine definition, a cost definition, and a scheduling template files. Our current implementation was generated from Intel MKL’s standard BLAS implementation and supports Intel MKL style variable size batched interface. Our evaluation results using XeonPhi 7210 processor demonstrated that the auto-generated batched BLAS routines achieved competitive performance with standard BLAS. Our results suggest that such an automatic generation would be an effective method to develop batched BLAS routines for future architectures.