Saturday, August 23, 2025
No Result
View All Result
Coin Digest Daily
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • DeFi
  • Analysis
  • Scam Alert
  • Regulations
Marketcap
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • DeFi
  • Analysis
  • Scam Alert
  • Regulations
No Result
View All Result
Coin Digest Daily
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

13 January 2025
in Blockchain
Reading Time: 2 mins read
0 0
A A
0
Home Blockchain
Share on FacebookShare on Twitter




Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for big language fashions with modern information curation strategies.





NVIDIA has introduced the discharge of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of huge language fashions (LLMs). This dataset, derived from Widespread Crawl, goals to raise the accuracy and effectivity of LLMs via modern information curation methods, together with the usage of 1.9 trillion tokens of synthetically generated information, in line with NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a important want in LLM coaching, the place the standard of pretraining datasets performs a pivotal function. Whereas latest fashions like Meta’s Llama sequence have been based mostly on datasets comprising as much as 15 trillion tokens, the precise composition of those datasets stays largely undisclosed. Nemotron-CC seeks to fill this hole by offering the broader neighborhood with a high-quality dataset able to supporting each quick and lengthy token horizon coaching.

Conventional datasets usually sacrifice as much as 90% of information to enhance benchmark accuracies, limiting their utility for intensive coaching. Nemotron-CC, nonetheless, demonstrates how one can rework Widespread Crawl information right into a superior dataset, surpassing even the Llama 3.1 8B mannequin via superior strategies resembling classifier ensembling and artificial information rephrasing.

Vital Outcomes

Nemotron-CC’s efficacy is evidenced by its efficiency in numerous benchmarks. When coaching 8B parameter fashions for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms main datasets like DCLM, growing MMLU scores by 5.6 factors. Moreover, the whole 6.3-trillion-token dataset matches DCLM on MMLU whereas providing 4 instances extra distinctive actual tokens. This permits efficient coaching over lengthy token horizons, with Nemotron-CC-trained fashions surpassing Llama 3.1 8B in a number of metrics, together with a 5-point enhance in MMLU and a 3.1-point rise in ARC-Problem scores.

Modern Knowledge Curation Strategies

The event of Nemotron-CC concerned a number of key insights. By ensembling totally different model-based classifiers, NVIDIA was in a position to choose a broader array of high-quality tokens. Moreover, rephrasing methods decreased noise and errors, yielding numerous and helpful information variants. The choice to disable conventional heuristic filters additional boosted the dataset’s high quality with out compromising accuracy.

NVIDIA utilized its NeMo Curator instrument to extract and refine information from Widespread Crawl, making use of filters for language, deduplication, and high quality classification. This course of was complemented by artificial information technology, contributing roughly two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as an important useful resource for pretraining state-of-the-art LLMs over various token horizons. NVIDIA plans to broaden its choices by releasing extra specialised datasets, together with these targeted on particular domains like arithmetic, to additional improve LLM capabilities.

Picture supply: Shutterstock



Source link

Tags: DatasetIntroducesLLMMassiveNemotronCCNVIDIAPretraining
Previous Post

Solana Faces a Bold New Challenger Lightchain AI and the Future of Blockchain – Press release Bitcoin News

Next Post

FLock Unveils Framework for Training Large Language Models on Consumer Hardware

Related Posts

LINK Price Prediction: Chainlink Eyes $28.50 Target as Bulls Test Critical $26.48 Resistance
Blockchain

LINK Price Prediction: Chainlink Eyes $28.50 Target as Bulls Test Critical $26.48 Resistance

23 August 2025
BCH Price Prediction: Bitcoin Cash Eyes $650 Break Above Key Resistance in Next 30 Days
Blockchain

BCH Price Prediction: Bitcoin Cash Eyes $650 Break Above Key Resistance in Next 30 Days

22 August 2025
ASIC Blocks 14,000 Scam Sites as Crypto Cons Flood Online
Blockchain

ASIC Blocks 14,000 Scam Sites as Crypto Cons Flood Online

23 August 2025
TRM Labs Launches Beacon Network to Track Stolen Crypto
Blockchain

TRM Labs Launches Beacon Network to Track Stolen Crypto

22 August 2025
The Role of Blockchain in Transforming ESG
Blockchain

The Role of Blockchain in Transforming ESG

21 August 2025
GENIUS Act Clash Heats Up Between Banks and Crypto Groups
Blockchain

GENIUS Act Clash Heats Up Between Banks and Crypto Groups

21 August 2025
Next Post
FLock Unveils Framework for Training Large Language Models on Consumer Hardware

FLock Unveils Framework for Training Large Language Models on Consumer Hardware

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
FTT jumps 7% as Backpack launches platform to help FTX victims liquidate claims – CoinJournal

FTT jumps 7% as Backpack launches platform to help FTX victims liquidate claims – CoinJournal

19 July 2025
PENDLE token goes live on BeraChain and HyperEVM to expand cross-chain utility – CoinJournal

PENDLE token goes live on BeraChain and HyperEVM to expand cross-chain utility – CoinJournal

30 July 2025
A Russian Hacking Group Is Using Fake Versions of MetaMask to Steal $1M in Crypto – Decrypt

A Russian Hacking Group Is Using Fake Versions of MetaMask to Steal $1M in Crypto – Decrypt

10 August 2025
Ethereum Reclaims $4,600 With Unprecedented $1 Billion In Spot ETF Inflow

Ethereum Reclaims $4,600 With Unprecedented $1 Billion In Spot ETF Inflow

13 August 2025
XRP Price Blasts Higher by 10%, Bulls Eye Even Bigger Gains

XRP Price Blasts Higher by 10%, Bulls Eye Even Bigger Gains

8 August 2025
PEPE Gears Up For 120% Move As Indicators Point To An End Of Decline | Bitcoinist.com

PEPE Gears Up For 120% Move As Indicators Point To An End Of Decline | Bitcoinist.com

8 August 2025
Anonymous Hacktivist Group Founder Spearheads Meme Coin While Facing 5 Years in Prison – Decrypt

Anonymous Hacktivist Group Founder Spearheads Meme Coin While Facing 5 Years in Prison – Decrypt

23 August 2025
AI-Powered Planning Tools Designed for Serious Growth | Entrepreneur

AI-Powered Planning Tools Designed for Serious Growth | Entrepreneur

23 August 2025
Ethereum Price Watch: $4,700 Holds Strong—Is $5K Within Reach? – Markets and Prices Bitcoin News

Ethereum Price Watch: $4,700 Holds Strong—Is $5K Within Reach? – Markets and Prices Bitcoin News

23 August 2025
Ethereum Open Interest Jumps 10% As $3.18B In New Positions Flood In

Ethereum Open Interest Jumps 10% As $3.18B In New Positions Flood In

23 August 2025
LINK Price Prediction: Chainlink Eyes $28.50 Target as Bulls Test Critical $26.48 Resistance

LINK Price Prediction: Chainlink Eyes $28.50 Target as Bulls Test Critical $26.48 Resistance

23 August 2025
Analyst Predicts What Will Happen When XRP Price Hits $4, $10, $100, And $1,000

Analyst Predicts What Will Happen When XRP Price Hits $4, $10, $100, And $1,000

23 August 2025
Facebook Twitter Instagram Youtube RSS
Coin Digest Daily

Stay ahead in the world of cryptocurrencies with Coin Digest Daily. Your daily dose of insightful news, market trends, and expert analyses. Empowering you to make informed decisions in the ever-evolving blockchain space.

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • NFT
  • Regulations
  • Scam Alert
  • Web3

SITEMAP

  • About us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2024 Coin Digest Daily.
Coin Digest Daily is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • DeFi
  • Analysis
  • Scam Alert
  • Regulations

Copyright © 2024 Coin Digest Daily.
Coin Digest Daily is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
  • bitcoinBitcoin(BTC)$115,053.00-1.46%
  • ethereumEthereum(ETH)$4,745.07-1.17%
  • rippleXRP(XRP)$3.02-1.60%
  • tetherTether(USDT)$1.00-0.03%
  • binancecoinBNB(BNB)$880.05-1.51%
  • solanaSolana(SOL)$203.393.33%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • staked-etherLido Staked Ether(STETH)$4,734.64-1.15%
  • dogecoinDogecoin(DOGE)$0.236832-0.38%
  • tronTRON(TRX)$0.362260-0.62%