Iris Coleman
Jan 10, 2025 14:13
NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for big language fashions with modern information curation strategies.
NVIDIA has introduced the discharge of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of huge language fashions (LLMs). This dataset, derived from Widespread Crawl, goals to raise the accuracy and effectivity of LLMs via modern information curation methods, together with the usage of 1.9 trillion tokens of synthetically generated information, in line with NVIDIA.
Enhancing LLM Pretraining
NVIDIA’s initiative addresses a important want in LLM coaching, the place the standard of pretraining datasets performs a pivotal function. Whereas latest fashions like Meta’s Llama sequence have been based mostly on datasets comprising as much as 15 trillion tokens, the precise composition of those datasets stays largely undisclosed. Nemotron-CC seeks to fill this hole by offering the broader neighborhood with a high-quality dataset able to supporting each quick and lengthy token horizon coaching.
Conventional datasets usually sacrifice as much as 90% of information to enhance benchmark accuracies, limiting their utility for intensive coaching. Nemotron-CC, nonetheless, demonstrates how one can rework Widespread Crawl information right into a superior dataset, surpassing even the Llama 3.1 8B mannequin via superior strategies resembling classifier ensembling and artificial information rephrasing.
Vital Outcomes
Nemotron-CC’s efficacy is evidenced by its efficiency in numerous benchmarks. When coaching 8B parameter fashions for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms main datasets like DCLM, growing MMLU scores by 5.6 factors. Moreover, the whole 6.3-trillion-token dataset matches DCLM on MMLU whereas providing 4 instances extra distinctive actual tokens. This permits efficient coaching over lengthy token horizons, with Nemotron-CC-trained fashions surpassing Llama 3.1 8B in a number of metrics, together with a 5-point enhance in MMLU and a 3.1-point rise in ARC-Problem scores.
Modern Knowledge Curation Strategies
The event of Nemotron-CC concerned a number of key insights. By ensembling totally different model-based classifiers, NVIDIA was in a position to choose a broader array of high-quality tokens. Moreover, rephrasing methods decreased noise and errors, yielding numerous and helpful information variants. The choice to disable conventional heuristic filters additional boosted the dataset’s high quality with out compromising accuracy.
NVIDIA utilized its NeMo Curator instrument to extract and refine information from Widespread Crawl, making use of filters for language, deduplication, and high quality classification. This course of was complemented by artificial information technology, contributing roughly two trillion tokens to the dataset.
Future Prospects
Nemotron-CC is positioned as an important useful resource for pretraining state-of-the-art LLMs over various token horizons. NVIDIA plans to broaden its choices by releasing extra specialised datasets, together with these targeted on particular domains like arithmetic, to additional improve LLM capabilities.
Picture supply: Shutterstock