Select Page



Joerg Hiller
Sep 06, 2024 07:14

Together AI enhances NVIDIA H200 and H100 GPU clusters with its Together Kernel Collection, offering significant performance improvements in AI training and inference.





Together AI has announced a significant enhancement to its GPU clusters with the integration of the NVIDIA H200 Tensor Core GPU, according to together.ai. This upgrade will be accompanied by the Together Kernel Collection (TKC), a custom-built kernel stack designed to optimize AI operations, providing substantial performance boosts for both training and inference tasks.

Enhanced Performance with TKC

The Together Kernel Collection (TKC) is engineered to accelerate common AI operations significantly. When compared to standard PyTorch implementations, TKC offers up to a 24% speedup for frequently used training operators and up to a 75% speedup for FP8 inference operations. This improvement is poised to reduce GPU hours, leading to cost efficiencies and faster time to market.

Training and Inference Optimization

TKC’s optimized kernels, such as the multi-layer perceptron (MLP) with SwiGLU activation, are crucial for training large language models (LLMs) like Llama-3. These kernels are reported to be 22-24% faster than standard implementations, with potential improvements up to 10% faster compared to the best existing baselines. Inference tasks benefit from a robust stack of FP8 kernels, which Together AI has optimized to deliver more than 75% speedup over base PyTorch implementations.

Native PyTorch Compatibility

TKC is fully integrated with PyTorch, enabling AI developers to utilize its optimizations seamlessly within their existing frameworks. This integration simplifies the adoption of TKC, making it as easy as changing import statements within PyTorch.

Production-Level Testing

Together AI ensures that TKC undergoes rigorous testing to meet production-level standards, guaranteeing high performance and reliability for real-world applications. All Together GPU Clusters, whether H200 or H100, will feature TKC out of the box.

NVIDIA H200: Faster Performance and Larger Memory

The NVIDIA H200 Tensor Core GPU, built on the Hopper architecture, is designed for high-performance AI and HPC workloads. According to NVIDIA, the H200 offers 40% faster inference performance on Llama 2 13B and 90% faster on Llama 2 70B, compared to its predecessor, the H100. The H200 features 141GB of HBM3e memory and 4.8TB/s of memory bandwidth, nearly doubling the capacity and 1.4 times the bandwidth of the H100.

High-Performance Interconnectivity

Together GPU Clusters leverage the SXM form factor for high bandwidth and fast data transfer, supported by NVIDIA’s NVLink and NVSwitch technologies for ultra-high-speed communication between GPUs. Combined with NVIDIA Quantum-2 3200Gb/s InfiniBand Networking, this setup is ideal for large-scale AI training and HPC workloads.

Cost-Effective Infrastructure

Together AI offers significant cost savings, with infrastructure designed to be up to 75% more cost-effective compared to cloud providers like AWS. The company also provides flexible commitment options, from one month to five years, ensuring the right resources at every stage of the AI development lifecycle.

Reliability and Support

Together AI’s GPU clusters come with a 99.9% uptime SLA and are backed by rigorous acceptance testing. The company’s White Glove Service offers end-to-end support, from cluster setup to ongoing maintenance, ensuring peak performance for AI models.

Flexible Deployment Options

Together AI provides several deployment options, including Slurm for high-performance workload management, Kubernetes for containerized AI workloads, and bare metal clusters running Ubuntu for direct access and ultimate flexibility. These options cater to different AI project needs, from large-scale training to production-level inference.

Together AI continues to support the entire AI lifecycle with its high-performance NVIDIA H200 GPU Clusters and the Together Kernel Collection. The platform is designed to optimize performance, reduce costs, and ensure reliability, making it an ideal choice for accelerating AI development.

Image source: Shutterstock


Share it on social networks