NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA introduces Nemotron 3, a family of open language models (Nano, Super, Ultra) built on a hybrid Mamba-Transformer MoE architecture. The core innovation is using selective attention layers combined with Mamba-2 state space layers to achieve high throughput while maintaining accuracy. Key technical contributions include LatentMoE (dimensionality-reduced expert routing), NVFP4 training for efficiency, and multi-environment RL post-training. The paper positions these models as optimized for agentic AI with up to 1M token contexts and granular inference-time reasoning budget control.
This white paper announces a promising model family with genuine architectural innovations, particularly LatentMoE and the hybrid Mamba-Transformer design. However, it reads more as a product announcement than a rigorous research evaluation. Key benchmark details are deferred to external technical reports, head-to-head comparisons are often missing or uneven, and only the smallest model (Nano) is actually released with this paper while larger variants are promised for future release. The "open and transparent" commitment is commendable but not yet fully realized.
The hybrid Mamba-Transformer MoE architecture is well-motivated for throughput-limited reasoning scenarios, replacing expensive attention layers with constant-memory Mamba-2 layers. The LatentMoE design (projecting to latent dimension $\ell < d$, scaling experts by $d/\ell$) offers a principled hardware-aware approach to improving accuracy per byte. The NVFP4 training methodology shows thoughtful engineering, keeping sensitive layers (QKV, Mamba outputs, final 15% of network) in higher precision to maintain stability. The multi-environment RL approach—training simultaneously on math, code, tool use, and long-context tasks rather than staging—is argued to improve stability and reduce reward hacking.
The paper makes strong "best-in-class" claims but provides minimal direct comparison data within the white paper itself, repeatedly deferring to external technical reports. Table 3's comparison between Nemotron 2 Nano (12B dense hybrid) and Nemotron 3 Nano (30B MoE hybrid) is misleading—the 2.5× parameter difference confounds any architectural conclusions. The LatentMoE ablation (Table 1) uses only 8B active parameter models trained to 1T tokens—far smaller than production scale—raising questions about generalization to Ultra-scale training. Claims about multi-environment RL superiority over staged training cite prior work but provide no internal ablation. Figure 8's "accuracy-efficiency trade-off" curves lack numerical values, making them unreproducible. Most critically, the paper describes Super and Ultra extensively but admits they are not yet released, making many claims unverifiable.
Evidence quality is uneven. MTP improvements (~2.4% average) are supported by Table 2 with specific benchmarks (MMLU, MBPP, GSM8K), though the 97% speculative decoding acceptance rate claim lacks supporting data. NVFP4's <1% loss gap vs BF16 is shown in Figure 4 for Nano and an 8B MoE model, but this needs validation at Super/Ultra scale given the 25T token claim. The RULER long-context evaluation (Table 3) compares 12B vs 30B models unfairly. The paper lacks comprehensive comparisons to contemporaries like Qwen3, DeepSeek-V3, or Llama 3.1/4—only throughput vs Qwen3-30B-A3B is shown (Figure 2) with a 3.3× claim, but no accuracy comparison accompanies it. The "state-of-the-art" claims for Ultra remain unsubstantiated in this document.
Reproducibility is mixed. NVIDIA commits to releasing "model weights, pre- and post-training software, training recipes, and all data for which we hold redistribution rights"—a stronger open-science stance than most industry labs. NeMo-RL and NeMo-Gym are already open-sourced under Apache 2.0. However, only Nano is currently available; Super and Ultra weights, data, and software are promised for "coming months." Critical hyperparameters (learning rates, batch sizes, exact MoE configurations for Super/Ultra) are omitted. The hybrid architecture's specific layer patterns (e.g., Figure 1 shows Nano's pattern but not Super/Ultra's) are not fully specified. Training data composition—beyond the 10T+ token count—is not detailed.
We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.