NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and performance of deep learning (DL) and high-performance computing (HPC) workloads, according to NVIDIA Technical Blog. Key updates include the introduction of Grouped GEMM APIs, improved matrix multiplication (matmul) performance on NVIDIA Hopper (H100 and H200) and Ada (L40S) GPUs, and enhanced performance tuning options.

Grouped GEMM APIs

The newly introduced Grouped GEMM APIs generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This approach has shown a 1.2x speedup in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs.

Two new sets of APIs support Grouped GEMM:

cublasgemmGroupedBatched for FP32 (including TF32) and FP64 precisions.
cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.

These APIs support variable shapes, transpositions, and scaling factors. Examples can be found on the NVIDIA/CUDALibrarySamples GitHub repository.

Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs

Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively. These improvements are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload.

*Figure 1. Speedup of the GEMM-only fraction of e2e workloads*

Library Performance and Benchmarking

Several enhancements have been made to runtime performance heuristics and performance tuning APIs. The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.

*Figure 2. Sampling of various GEMMs using multiple configurations in different kernel families*

For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.

*Figure 4. An example of auto-tuning in cuBLAS*

Better Functionality and Performance in cuBLASLt

Since cuBLAS 12.0, numerous enhancements have been introduced:

Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
Additional fused epilogues on NVIDIA Hopper and Ampere.
Support for FP8 on Ada GPUs and performance updates on Ada L4, L40, and L40S.
Removal of M, N, and batch size limitations of cuBLASLt matmul API.
Improved performance of heuristics cache for workloads with high eviction rate.
cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository.

For more information on cuBLAS, see the documentation and samples.

Image source: Shutterstock

. . .