NVIDIA NeMo has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry, particularly those topping the Hugging Face Open ASR Leaderboard, according to NVIDIA Technical Blog. Recent advancements have accelerated the inference speed of these models by up to 10x through key optimizations.
Enhancements Driving Speed Improvements
To achieve this significant speed boost, NVIDIA implemented several enhancements, including autocasting tensors to bfloat16
, the innovative label-looping algorithm, and the introduction of CUDA Graphs. These improvements are available in NeMo 2.0.0, which offers a fast and cost-effective alternative to CPUs.
Overcoming Speed Performance Bottlenecks
Several bottlenecks previously hindered the performance of NeMo ASR models, such as casting overheads, low compute intensity, and divergence performance issues. The implementation of full half-precision inference and batch processing optimization has significantly reduced these bottlenecks.
Casting Overheads
Autocast behavior, parameter handling, and frequent cache clearing were major issues causing casting overheads. By shifting to full half-precision inference, NVIDIA eliminated unnecessary casting without compromising accuracy.
Optimizing Batch Processing
Moving from sequential to fully batched processing for operations like CTC greedy decoding and feature normalization increased throughput by 10%, resulting in an overall speedup of approximately 20%.
Low Compute Intensity
RNN-T and TDT models were previously seen as unsuitable for server-side GPU inference due to their autoregressive prediction and joint networks. The introduction of CUDA Graphs conditional nodes has eliminated kernel launch overhead, significantly improving performance.
Divergence in Prediction Networks
Batched inference for RNN-T and TDT models faced issues due to divergence in vanilla greedy search algorithms. The label-looping algorithm introduced by NVIDIA addresses this by swapping the roles of nested loops, resulting in much faster decoding.
Performance and Cost Efficiency
The enhancements have brought transducer models’ inverse real-time factor (RTFx) closer to that of CTC models, particularly benefiting smaller models. These improvements have also resulted in substantial cost savings. For instance, using GPUs for RNN-T inference can yield up to 4.5x cost savings compared to CPU-based alternatives.
As detailed in a comparison by NVIDIA, transcribing 1 million hours of speech using the NVIDIA Parakeet RNN-T 1.1B model on AWS instances showed significant cost advantages. CPU-based transcription costs amounted to $11,410, while GPU-based transcription costs were only $2,499.
Future Prospects
NVIDIA continues to optimize models like Canary 1B and Whisper to further reduce the cost of running attention-encoder-decoder and speech LLM-based ASR models. The integration of CUDA Graphs conditional nodes with compiler frameworks like TorchInductor is expected to provide further GPU speedups and efficiency gains.
For more information, visit the official NVIDIA blog.
Image source: Shutterstock