NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL cs.AI cs.LG cs.CL NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Anjulie Agrusa, Ankur Verma, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Cyril Meurillon, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Evgeny Tsykunov, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frank Sun, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt, Gargi Prasad, George Armstrong, Gerald Shen, Gorkem Batmaz, Grigor Nalbandyan, Haifeng Qian, Harsh Sharma, Hayley Ross, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huizi Mao, Huy C Nguyen, Huy Q Nguyen, Iain Cunningham, Ido Galil, Ido Shahaf, Igor Gitman, Ilya Loshchilov, Itamar Schen, Itay Levy, Ivan Moshkov, Izik Golan, Izzy Putterman, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jian Zhang, Jiaqi Zeng, Jie Lou, Jimmy Zhang, Jinhang Choi, Jining Huang, Joey Conway, Joey Guman, John Kamalu, Johnny Greco, Jonathan Cohen, Joseph Jennings, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kai Xu, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khushi Bhardwaj, Kirthi Shankar, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Li Ding, Lizzie Wei, Lucas Liebenwein, Luis Vega, Maanu Grover, Maarten Van Segbroeck, Maer Rodrigues de Melo, Mahdi Nazemi, Makesh Narsimhan Sreedhar, Manoj Kilaru, Maor Ashkenazi, Marc Romeijn, Marcin Chochowski, Mark Cai, Markus Kliegl, Maryam Moosaei, Matt Kulka, Matvei Novikov, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Meredith Price, Michael Andersch, Michael Boone, Michael Evans, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Minseok Lee, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Natalie Hereth, Nave Assaf, Negar Habibi, Neta Zmora, Netanel Haber, Nicola Sessions, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nir Ailon, Nirmal Juluru, Nishant Sharma, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Ouye Xie, Parth Chadha, Pasha Shamis, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pinky Xu, Piotr Januszewski, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Qing Miao, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachit Garg, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Rich Harang, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Hesse, Roger Waleffe, Rohit Watve, Roi Koren, Ruoxi Zhang, Russell Hewett, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Sadegh Mahdavi, Sahil Modi, Samuel Kriman, Sangkug Lim, Sanjay Kariyappa, Sanjeev Satheesh, Saori Kaji, Satish Pasumarthi, Saurav Muralidharan, Sean Narentharen, Sean Narenthiran, Seonmyeong Bak, Sergey Kashirsky, Seth Poulos, Shahar Mor, Shanmugam Ramasamy, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shiqing Fan, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Simeng Sun, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Sugam Dipak Devare, Sumeet Kumar Barua, Suseella Panguluri, Suyog Gupta, Sweta Priyadarshi, Syeda Nahida Akter, Tan Bui, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tijmen Blankevoort, Tim Moon, Tom Balough, Tomer Asida, Tomer Bar Natan, Tomer Ronen, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Venkat Srinivasan, Venmugil Elango, Victor Cui, Vijay Korthikanti, Vinay Rao, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Yigong Qin, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zhongbo Zhu, Zihan Liu, Zijia Chen, Zijie Yan · Dec 24, 2025
Local to this browser
What it does
NVIDIA introduces Nemotron 3, a family of open language models (Nano, Super, Ultra) built on a hybrid Mamba-Transformer MoE architecture. The core innovation is using selective attention layers combined with Mamba-2 state space layers to...
Why it matters
Key technical contributions include LatentMoE (dimensionality-reduced expert routing), NVFP4 training for efficiency, and multi-environment RL post-training. The paper positions these models as optimized for agentic AI with up to 1M token...
Main concern
This white paper announces a promising model family with genuine architectural innovations, particularly LatentMoE and the hybrid Mamba-Transformer design. However, it reads more as a product announcement than a rigorous research...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

NVIDIA introduces Nemotron 3, a family of open language models (Nano, Super, Ultra) built on a hybrid Mamba-Transformer MoE architecture. The core innovation is using selective attention layers combined with Mamba-2 state space layers to achieve high throughput while maintaining accuracy. Key technical contributions include LatentMoE (dimensionality-reduced expert routing), NVFP4 training for efficiency, and multi-environment RL post-training. The paper positions these models as optimized for agentic AI with up to 1M token contexts and granular inference-time reasoning budget control.

Critical review
Verdict
Bottom line

This white paper announces a promising model family with genuine architectural innovations, particularly LatentMoE and the hybrid Mamba-Transformer design. However, it reads more as a product announcement than a rigorous research evaluation. Key benchmark details are deferred to external technical reports, head-to-head comparisons are often missing or uneven, and only the smallest model (Nano) is actually released with this paper while larger variants are promised for future release. The "open and transparent" commitment is commendable but not yet fully realized.

“Nemotron 3 Nano is released along with this white paper. Super and Ultra releases will follow in the upcoming months.”
paper · Section 3
“For details, please see the Nemotron Nano 3 technical report.”
paper · Figure 2 caption
What holds up

The hybrid Mamba-Transformer MoE architecture is well-motivated for throughput-limited reasoning scenarios, replacing expensive attention layers with constant-memory Mamba-2 layers. The LatentMoE design (projecting to latent dimension $\ell < d$, scaling experts by $d/\ell$) offers a principled hardware-aware approach to improving accuracy per byte. The NVFP4 training methodology shows thoughtful engineering, keeping sensitive layers (QKV, Mamba outputs, final 15% of network) in higher precision to maintain stability. The multi-environment RL approach—training simultaneously on math, code, tool use, and long-context tasks rather than staging—is argued to improve stability and reduce reward hacking.

“By shifting routed expert computation and all-to-all traffic to the latent space, both per-expert weight loads and communication payloads are reduced by a factor of $d/\ell$”
paper · Section 2.2
“We kept the last 15% of the network in high precision to maintain stability”
paper · Section 2.4
Main concerns

The paper makes strong "best-in-class" claims but provides minimal direct comparison data within the white paper itself, repeatedly deferring to external technical reports. Table 3's comparison between Nemotron 2 Nano (12B dense hybrid) and Nemotron 3 Nano (30B MoE hybrid) is misleading—the 2.5× parameter difference confounds any architectural conclusions. The LatentMoE ablation (Table 1) uses only 8B active parameter models trained to 1T tokens—far smaller than production scale—raising questions about generalization to Ultra-scale training. Claims about multi-environment RL superiority over staged training cite prior work but provide no internal ablation. Figure 8's "accuracy-efficiency trade-off" curves lack numerical values, making them unreproducible. Most critically, the paper describes Super and Ultra extensively but admits they are not yet released, making many claims unverifiable.

“Nemotron-Nano-12B-v2-Base (Dense Hybrid) and Nemotron-3-Nano-30B-A3B-Base (MoE hybrid)”
paper · Table 3
“We find such simultaneous training is more stable, less prone to reward hacking and overall better compared to previous staged approaches”
paper · Section 2.6
Evidence and comparison

Evidence quality is uneven. MTP improvements (~2.4% average) are supported by Table 2 with specific benchmarks (MMLU, MBPP, GSM8K), though the 97% speculative decoding acceptance rate claim lacks supporting data. NVFP4's <1% loss gap vs BF16 is shown in Figure 4 for Nano and an 8B MoE model, but this needs validation at Super/Ultra scale given the 25T token claim. The RULER long-context evaluation (Table 3) compares 12B vs 30B models unfairly. The paper lacks comprehensive comparisons to contemporaries like Qwen3, DeepSeek-V3, or Llama 3.1/4—only throughput vs Qwen3-30B-A3B is shown (Figure 2) with a 3.3× claim, but no accuracy comparison accompanies it. The "state-of-the-art" claims for Ultra remain unsubstantiated in this document.

“MTP improves performance by roughly 2.4% on average across benchmarks”
paper · Section 2.3
“On Nano, we achieve a < 1% relative difference in loss between NVFP4 vs BF16”
paper · Section 2.4
Reproducibility

Reproducibility is mixed. NVIDIA commits to releasing "model weights, pre- and post-training software, training recipes, and all data for which we hold redistribution rights"—a stronger open-science stance than most industry labs. NeMo-RL and NeMo-Gym are already open-sourced under Apache 2.0. However, only Nano is currently available; Super and Ultra weights, data, and software are promised for "coming months." Critical hyperparameters (learning rates, batch sizes, exact MoE configurations for Super/Ultra) are omitted. The hybrid architecture's specific layer patterns (e.g., Figure 1 shows Nano's pattern but not Super/Ultra's) are not fully specified. Training data composition—beyond the 10T+ token count—is not detailed.

“We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.”
paper · Abstract
“NeMo-RL implements scalable RL training while NeMo-Gym provides collection of RL environments.”
paper · Section 2.6
Abstract

We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.