August 20, 2025 — NVIDIA today unveiled its Nemotron Nano 2 family, a line of enterprise-ready large language models that merge a hybrid Mamba-Transformer architecture with remarkable inference speed and reasoning prowess.
Highlights at a Glance
-
Blazing throughput: Tests show Nemotron Nano 2 models can generate tokens up to six times faster than similarly sized models—such as Qwen3‑8B—particularly in complex reasoning tasks involving long input-output sequences.
-
Advanced architecture: Following the Nemotron‑H design, this series replaces most traditional self-attention layers with efficient Mamba‑2 state-space layers, while retaining a few sparse Transformer layers, enabling high-speed “thinking traces” and better handling of longer contexts.
-
128K token context on a single GPU: The models support extremely long context lengths—up to 128,000 tokens—and can run that capacity on a single NVIDIA A10G GPU (22 GiB), thanks to pruning and compression techniques.
-
Strong reasoning, coding, and multilingual performance: Benchmarks show Nemotron Nano 2 delivers equal or superior accuracy across tasks involving math, code generation, multilingual understanding, tools, and long-context reasoning.
-
Open and transparent: NVIDIA is releasing the models—including Nemotron‑Nano‑9B‑v2, the pruned and aligned reasoning model, along with base variants—as well as the substantial pre-training and post-training datasets with permissive licensing via Hugging Face.
