NVIDIA NeMo Curator Enhances Vietnamese Language Data Processing

James Ding
Nov 21, 2024 01:30

NVIDIA NeMo Curator aids in processing high-quality Vietnamese language data, enhancing language model training through efficient data curation techniques.

Open-source large language models (LLMs) are often proficient in English, but they face challenges with other languages, particularly those in Southeast Asia, due to a scarcity of training data. Addressing this issue, Viettel Solutions, a subsidiary of Viettel Corporation, has adopted NVIDIA’s NeMo Curator to enhance the processing of high-quality Vietnamese language data, as reported by NVIDIA.

Challenges with Language Models

LLMs typically excel in English due to abundant training data. However, languages like Vietnamese often lack sufficient data, which affects model performance. NVIDIA’s NeMo Curator offers a solution by enabling the creation of high-quality datasets necessary for training effective language models.

Viettel’s Collaboration with NVIDIA

Viettel Solutions has leveraged NeMo Curator to train its Llama 3 ViettelSolution 8B model, now ranking among the top in the VMLU leaderboard. The tool’s GPU-accelerated features, such as deduplication and filtering, have increased model accuracy by 10%, reduced training time by threefold, and decreased dataset size by 60%, according to Tuan Nguyen, Head of Data Analytics at Viettel Solutions.

Data Curation Pipeline

The data curation process includes downloading datasets from various sources, reformatting Unicode, deduplicating, and applying quality filtering. The datasets include Vietnamese subsets from C4, OSCAR, and Wikipedia, combined into a single dataset for training. NeMo Curator employs heuristic and classifier-based filtering to enhance data quality, ensuring the removal of noise and preserving essential content diversity.

Advanced Filtering Techniques

Heuristic filtering removes low-quality content using predefined rules, while classifier-based filtering employs a trained model to identify high and low-quality data. This dual approach ensures that the dataset is both comprehensive and of high quality, crucial for effective language model training.

Impact on Dataset Quality

The curation process significantly reduces dataset size by removing low-quality and redundant content, with classifier-based filtering alone accounting for a 45% reduction. This efficient filtering ensures that the remaining data is of the highest quality, suitable for pretraining language models.

Conclusion

NVIDIA’s NeMo Curator provides a robust tool for processing high-quality Vietnamese language data, enhancing the performance of language models. By improving data quality and efficiency, it supports Viettel Solutions’ goal of leading in generative AI and developing AI-powered products for the Vietnamese market.

Image source: Shutterstock

Share it on social networks