Skip to main content

NVIDIA Nemotron3-Nano

1. Model Introduction

NVIDIA Nemotron3-Nano is a 30B-parameter hybrid LLM that mixes Mixture-of-Experts (MoE) feed-forward layers, Mamba2 sequence-modeling layers, and standard self-attention layers in a single stack rather than classic “attention + MLP” transformer blocks.

The BF16 variant (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is designed as a high-fidelity reference model, while the FP8 variant (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) targets optimized inference performance on modern NVIDIA GPUs.

At a high level:

  • Hybrid layer stack (Mamba2 + MoE + attention): The network is composed of interleaved layers that are either Mamba2, or MoE feed-forward, or attention-only.
  • Non-uniform layer ordering: The order and mix of these specialized layers is not a simple, rigid pattern, enabling the model to trade off sequence modeling, routing capacity, and expressivity across depth.
  • Deployment-friendly precision: Use BF16 for accuracy-sensitive and evaluation workloads; use FP8 for latency- and throughput-critical serving on recent NVIDIA GPUs.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.

For a quick start, please install the nightly wheel for SGLang:

pip install sglang==0.5.6.post2.dev7852+g8102e36b5 --extra-index-url https://sgl-project.github.io/whl/nightly/

3. Model Deployment

This section provides a progressive guide from quick deployment to performance tuning.

3.1 Basic Configuration

Interactive Command Generator: select hardware, model variant, and common knobs to generate a launch command.

1Hardware Platform
2Model Variant
3Tensor Parallel (TP)
4KV Cache DType
5Reasoning Parser
6Tool Call Parser
7Host
8Port
Generated Command
python3 -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
  --trust-remote-code \
  --tp 1 \
  --kv-cache-dtype fp8_e4m3 \
  --host 0.0.0.0 \
  --port 30000

3.2 Configuration Tips

  • Attention backend:

    H200/B200: use flashinfer attention backend by default.

  • TP support:

    To set tp size, use --tp <1|2|4|8>.

  • FP8 KV cache:

    To enable fp8 kv cache, please append --kv-cache-dtype fp8_e4m3.


4. Model Invocation

4.1 Basic Usage (OpenAI-Compatible API)

SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)

resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize what MoE models are in 5 bullets."},
],
temperature=0.7,
max_tokens=256,
)

print(resp.choices[0].message.content)

Streaming chat completion

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)

stream = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the first 5 prime numbers?"}
],
temperature=0.7,
max_tokens=1024,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)

4.2 Reasoning

To enable reasoning, --reasoning-parser nano_v3 should be appended to the launching command. The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)

# Reasoning on (default)
print("Reasoning on")
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about GPUs."}
],
temperature=0.7,
max_tokens=512,
)
print(resp.choices[0].message.reasoning_content)

# Reasoning off
print("Reasoning off")
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about GPUs."}
],
temperature=0.6,
max_tokens=256,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(resp.choices[0].message.reasoning_content)

4.3 Tool calling

To enable reasoning, --tool-call-parser qwen3_coder should be appended to the launching command. Call functions using the OpenAI Tools schema and inspect returned tool_calls.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)

# Tool calling via OpenAI tools schema
TOOLS = [
{
"type": "function",
"function": {
"name": "calculate_tip",
"parameters": {
"type": "object",
"properties": {
"bill_total": {
"type": "integer",
"description": "The total amount of the bill"
},
"tip_percentage": {
"type": "integer",
"description": "The percentage of tip to be applied"
}
},
"required": ["bill_total", "tip_percentage"]
}
}
}
]

completion = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
messages=[
{"role": "system", "content": ""},
{"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
],
tools=TOOLS,
temperature=0.6,
top_p=0.95,
max_tokens=512,
stream=False
)

print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.tool_calls)

5. Benchmark

5.1 Speed Benchmark

Test Environment:

  • Hardware: NVIDIA B200 GPU

FP8 variant

  • Model Deployment Command:
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--trust-remote-code \
--max-running-requests 1024 \
--host 0.0.0.0 \
--port 30000
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 4096 \
--max-concurrency 256
  • Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 256
Successful requests: 4096
Benchmark duration (s): 183.18
Total input tokens: 2081726
Total input text tokens: 2081726
Total input vision tokens: 0
Total generated tokens: 2116125
Total generated tokens (retokenized): 1076256
Request throughput (req/s): 22.36
Input token throughput (tok/s): 11364.25
Output token throughput (tok/s): 11552.04
Peak output token throughput (tok/s): 24692.00
Peak concurrent requests: 294
Total token throughput (tok/s): 22916.30
Concurrency: 251.19
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 11233.74
Median E2E Latency (ms): 11142.97
---------------Time to First Token----------------
Mean TTFT (ms): 172.99
Median TTFT (ms): 116.57
P99 TTFT (ms): 1193.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.74
Median TPOT (ms): 21.14
P99 TPOT (ms): 41.12
---------------Inter-Token Latency----------------
Mean ITL (ms): 21.45
Median ITL (ms): 9.06
P95 ITL (ms): 62.59
P99 ITL (ms): 110.83
Max ITL (ms): 5368.19
==================================================

BF16 variant

  • Model Deployment Command:
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust-remote-code \
--max-running-requests 1024 \
--host 0.0.0.0 \
--port 30000
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 4096 \
--max-concurrency 256
  • Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 256
Successful requests: 4096
Benchmark duration (s): 360.22
Total input tokens: 2081726
Total input text tokens: 2081726
Total input vision tokens: 0
Total generated tokens: 2087288
Total generated tokens (retokenized): 1940652
Request throughput (req/s): 11.37
Input token throughput (tok/s): 5779.10
Output token throughput (tok/s): 5794.55
Peak output token throughput (tok/s): 9169.00
Peak concurrent requests: 276
Total token throughput (tok/s): 11573.65
Concurrency: 249.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 21965.10
Median E2E Latency (ms): 21706.35
---------------Time to First Token----------------
Mean TTFT (ms): 211.54
Median TTFT (ms): 93.06
P99 TTFT (ms): 2637.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 43.27
Median TPOT (ms): 43.04
P99 TPOT (ms): 61.15
---------------Inter-Token Latency----------------
Mean ITL (ms): 42.77
Median ITL (ms): 28.46
P95 ITL (ms): 71.85
P99 ITL (ms): 113.20
Max ITL (ms): 5237.28
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

Environment

  • Hardware: NVIDIA B200 GPU
  • Model: BF16 checkpoint

Launch Model

python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust-remote-code \
--reasoning-parser nano_v3

Run Benchmark with lm-eval

pip install lm-eval[api]==0.4.9.2

lm_eval --model local-completions --tasks gsm8k --model_args "model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,max_lengths=16384" --gen_kwargs '{"chat_template_kwargs":{"thinking":true}}' --batch_size 256

Test Results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.5603|± |0.0137|
| | |strict-match | 5|exact_match|↑ |0.8453|± |0.0100|