Skip to main content

Gemma 4

1. Model Introduction

Gemma 4 is Google's next-generation family of open models, building on the Gemma 3 architecture with improved performance, MoE variants, and multimodal support for text, vision, and audio.

Key Features:

  • Hybrid Attention: Combines sliding window and full attention layers for efficient long-context processing
  • Multimodal: Supports text, image, and audio inputs via dedicated vision and audio encoders
  • MoE Variant: The 26B-A4B model uses a Mixture-of-Experts architecture for efficient inference
  • Per-Layer Embeddings (PLE): Layer-specific token embeddings for enhanced representations
  • Reasoning: Built-in thinking mode with gemma4 reasoning parser
  • Tool Calling: Function call support with streaming via gemma4 tool call parser
  • Fused Operations: Triton-optimized RMSNorm + residual + scalar kernels

Available Models:

ModelArchitectureParameters
google/gemma-4-E2B-itDense~2B
google/gemma-4-E4B-itDense~4B
google/gemma-4-31B-itDense31B
google/gemma-4-26B-A4B-itMoE26B total / 4B active

2. SGLang Installation

Gemma 4 support requires sgl-project/sglang#21952 and a specific transformers commit:

# Install SGLang from main branch (after sglang#21952 is merged)
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

# Install transformers with Gemma 4 support
pip install 'git+https://github.com/huggingface/transformers.git@91b1ab1fdfa81a552644a92fbe3e8d88de40e167'

For the full Docker setup and other installation methods, please refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model variant.

Model Variant
Hardware Platform
Reasoning Parser
Tool Call Parser
Run this Command:
sglang serve --model-path google/gemma-4-E4B-it \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --mem-fraction-static 0.85 \
  --host 0.0.0.0 --port 30000

3.2 Configuration Tips

  • SGLang automatically selects the Triton attention backend for Gemma 4 models (required for bidirectional image-token attention during prefill).
  • For the 26B-A4B MoE model, consider --tp 2 for high-throughput workloads.
  • Hardware requirements:
ModelHardwareTP
gemma-4-E2B-it1x H2001
gemma-4-E4B-it1x H2001
gemma-4-31B-it2x H2002
gemma-4-26B-A4B-it1x H2001

3.3 AMD GPU Deployment (MI300X / MI325X)

SGLang automatically selects the correct attention backend on AMD GPUs. The same commands work on AMD. Example for MI300X:

sglang serve --model-path google/gemma-4-E4B-it \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--host 0.0.0.0 --port 30000

Status: AMD MI300X benchmarks are available in Section 5.1.

4. Model Invocation

Deploy gemma-4-26B-A4B-it (MoE) with all features enabled:

sglang serve --model-path google/gemma-4-26B-A4B-it \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--host 0.0.0.0 --port 30000

4.1 Basic Usage

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What are the key differences between TCP and UDP?"}
],
max_tokens=1024
)

print(response.choices[0].message.content)
Example Output
The fundamental difference between **TCP (Transmission Control Protocol)** and **UDP (User Datagram
Protocol)** lies in how they prioritize data integrity versus speed.

### 1. Connection Type
* **TCP (Connection-Oriented):** Before any data is sent, TCP performs a "three-way handshake."
The sender and receiver exchange signals to establish a formal connection.
* **UDP (Connectionless):** UDP does not establish a connection. It simply starts blasting packets
to the destination IP address without checking if the receiver is ready.

### 2. Reliability and Error Checking
* **TCP (Reliable):** If a packet is lost or arrives corrupted, TCP detects the error and
retransmits the missing data.
* **UDP (Unreliable):** If a packet is lost or corrupted, it is simply discarded. There is no
mechanism to ask for a retransmission.

### 3. Ordering of Data
* **TCP (Ordered):** Segments are assigned sequence numbers and reassembled in the correct order.
* **UDP (Unordered):** Packets may arrive in a different order than sent.

### 4. Speed and Overhead
* **TCP (Slower):** Managing connections, tracking, and retransmissions adds significant overhead.
* **UDP (Faster):** No handshake, no tracking — extremely fast and ideal for real-time needs.

| Feature | TCP | UDP |
| :--- | :--- | :--- |
| **Connection** | Connection-oriented | Connectionless |
| **Reliability** | Guaranteed delivery | Best-effort |
| **Ordering** | Maintains strict order | No guaranteed order |
| **Speed** | Slower (High overhead) | Faster (Low overhead) |

4.2 Vision Input

Gemma 4 multimodal variants accept images alongside text:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg"
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}
],
max_tokens=1024
)

print(response.choices[0].message.content)
Example Output
A vertical, full shot shows a girl and a boy standing in front of a giant teddy bear. The boy, who
is on the left, is of South Asian descent, has short dark hair, and is smiling at the camera. He is
wearing a navy blue sweatshirt with a white collar, blue jeans, and white, black, and red sneakers.
The girl, on the right, is also of South Asian descent and has long, dark hair. She is smiling at
the camera and is wearing a pink t-shirt, a white long-sleeve shirt underneath, blue jeans, and pink
sneakers. The giant teddy bear is light brown and is standing behind the two children. The bear has
large, dark eyes and a black nose. In the background, on the left, there is a large wooden basket
filled with small teddy bears. To the left of the basket, an American flag is hanging on the wall.
On the right side of the image, there is a green leafy plant. The floor is a dark purple carpet. The
lighting is bright and even.

4.3 Reasoning (Thinking Mode)

Gemma 4 supports hybrid reasoning. Thinking is not enabled by default — pass chat_template_kwargs: {"enable_thinking": true} via extra_body to activate it. The reasoning parser separates thinking and content, returning the thinking process via reasoning_content in the streaming response.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "Solve step by step: If a train travels at 60 km/h for 2.5 hours, how far does it go?"}
],
max_tokens=4096,
stream=True,
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)

# Print answer content
if delta.content:
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)

print()
Example Output
=============== Thinking =================
* Input: Speed = 60 km/h, Time = 2.5 hours.
* Goal: Find the distance traveled.
* Distance = Speed × Time.
* Step 1: Identify given values. Speed = 60 km/h, Time = 2.5 hours
* Step 2: Formula. Distance = Speed × Time
* Step 3: Calculation. 60 × 2.5
Mental math: 60 × 2 = 120; 60 × 0.5 = 30; 120 + 30 = 150.
* Step 4: Final Result. 150 km.

=============== Content =================
To find the distance traveled, you can follow these steps:

### 1. Identify the given information:
* **Speed:** 60 km/h
* **Time:** 2.5 hours

### 2. Use the distance formula:
Distance = Speed × Time

### 3. Substitute the values:
Distance = 60 km/h × 2.5 hours

### 4. Perform the calculation:
* 60 × 2 = 120
* 60 × 0.5 = 30
* 120 + 30 = 150

**Final Answer: The train travels 150 km.**

4.4 Tool Calling

Gemma 4 supports function calling with the gemma4 tool call parser. Enable it during deployment with --tool-call-parser gemma4.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]

response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"}
],
tools=tools,
stream=True
)

thinking_started = False
has_thinking = False

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)

if hasattr(delta, 'tool_calls') and delta.tool_calls:
if has_thinking and thinking_started:
print("\n=============== Tool Calls ================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")

if delta.content:
print(delta.content, end="", flush=True)

print()
Example Output
=============== Tool Calls ================
Tool Call: get_weather
Arguments: {"location": "Tokyo"}

5. Benchmark

5.1 Speed Benchmark

Test Environment:

  • Hardware: H200
  • SGLang Version: gemma4 branch

gemma-4-E2B-it (1x H200, TP=1)

Server Launch Command:

sglang serve --model-path google/gemma-4-E2B-it

Latency Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 10 --max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 17.44
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.57
Output token throughput (tok/s): 242.03
Total token throughput (tok/s): 591.94
Mean TTFT (ms): 50.19
Median TTFT (ms): 54.22
Mean TPOT (ms): 3.99
Median ITL (ms): 4.05
==================================================

Latency Benchmark (Image)

python3 -m sglang.bench_serving --backend sglang-oai-chat \
--host 0.0.0.0 --port 30000 \
--dataset-name image --image-count 2 --image-resolution 720p \
--random-input-len 128 --random-output-len 1024 \
--num-prompts 10 --max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 18.05
Total input tokens: 6097
Total input vision tokens: 5340
Total generated tokens: 4220
Request throughput (req/s): 0.55
Output token throughput (tok/s): 233.84
Total token throughput (tok/s): 571.69
Mean TTFT (ms): 109.59
Median TTFT (ms): 112.62
Mean TPOT (ms): 4.01
Median ITL (ms): 4.04
==================================================

Throughput Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 1000 --max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 51.73
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 19.33
Output token throughput (tok/s): 9876.36
Peak output token throughput (tok/s): 13863.00
Total token throughput (tok/s): 19791.14
Mean TTFT (ms): 86.57
Mean TPOT (ms): 9.56
Median ITL (ms): 5.99
==================================================

Throughput Benchmark (Image)

python3 -m sglang.bench_serving --backend sglang-oai-chat \
--host 0.0.0.0 --port 30000 \
--dataset-name image --image-count 2 --image-resolution 720p \
--random-input-len 128 --random-output-len 1024 \
--num-prompts 1000 --max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 89.07
Total input tokens: 617799
Total input vision tokens: 534000
Total generated tokens: 510855
Request throughput (req/s): 11.23
Output token throughput (tok/s): 5735.75
Peak output token throughput (tok/s): 12823.00
Total token throughput (tok/s): 12672.23
Mean TTFT (ms): 636.46
Mean TPOT (ms): 16.34
Median ITL (ms): 5.68
==================================================

gemma-4-E4B-it (1x H200, TP=1)

Server Launch Command:

sglang serve --model-path google/gemma-4-E4B-it

Latency Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 24.49
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.41
Output token throughput (tok/s): 172.32
Total token throughput (tok/s): 421.45
Mean TTFT (ms): 52.76
Median TTFT (ms): 53.66
Mean TPOT (ms): 5.64
Median ITL (ms): 5.74
==================================================

Latency Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 25.04
Total input tokens: 6124
Total input vision tokens: 5340
Total generated tokens: 4220
Request throughput (req/s): 0.40
Output token throughput (tok/s): 168.54
Total token throughput (tok/s): 413.13
Mean TTFT (ms): 110.15
Median TTFT (ms): 108.24
Mean TPOT (ms): 5.66
Median ITL (ms): 5.73
==================================================

Throughput Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 72.95
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 13.71
Output token throughput (tok/s): 7002.68
Peak output token throughput (tok/s): 9878.00
Total token throughput (tok/s): 14032.60
Mean TTFT (ms): 166.33
Mean TPOT (ms): 13.36
Median ITL (ms): 8.88
==================================================

Throughput Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 108.99
Total input tokens: 616952
Total input vision tokens: 534000
Total generated tokens: 510855
Request throughput (req/s): 9.18
Output token throughput (tok/s): 4687.38
Peak output token throughput (tok/s): 9277.00
Total token throughput (tok/s): 10348.25
Mean TTFT (ms): 626.17
Mean TPOT (ms): 20.00
Median ITL (ms): 8.64
==================================================

gemma-4-31B-it (2x H200, TP=2)

Server Launch Command:

sglang serve --model-path google/gemma-4-31B-it --tp 2

Latency Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 53.05
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.19
Output token throughput (tok/s): 79.55
Total token throughput (tok/s): 194.55
Mean TTFT (ms): 72.77
Median TTFT (ms): 75.05
Mean TPOT (ms): 12.32
Median ITL (ms): 12.53
==================================================

Latency Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 53.78
Total input tokens: 6162
Total input vision tokens: 5340
Total generated tokens: 4220
Request throughput (req/s): 0.19
Output token throughput (tok/s): 78.46
Total token throughput (tok/s): 193.03
Mean TTFT (ms): 143.35
Median TTFT (ms): 146.85
Mean TPOT (ms): 12.37
Median ITL (ms): 12.48
==================================================

Throughput Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 182.00
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 5.49
Output token throughput (tok/s): 2806.82
Peak output token throughput (tok/s): 3798.00
Total token throughput (tok/s): 5624.56
Mean TTFT (ms): 324.67
Mean TPOT (ms): 33.95
Median ITL (ms): 25.44
==================================================

Throughput Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 236.46
Total input tokens: 621630
Total input vision tokens: 534000
Total generated tokens: 510855
Request throughput (req/s): 4.23
Output token throughput (tok/s): 2160.42
Peak output token throughput (tok/s): 3745.00
Total token throughput (tok/s): 4789.30
Mean TTFT (ms): 952.02
Mean TPOT (ms): 44.17
Median ITL (ms): 26.81
==================================================

gemma-4-26B-A4B-it (MoE, 1x H200, TP=1)

Server Launch Command:

sglang serve --model-path google/gemma-4-26B-A4B-it

Tip: Consider --tp 2 for high-throughput workloads.

Latency Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 25.00
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.40
Output token throughput (tok/s): 168.81
Total token throughput (tok/s): 412.85
Mean TTFT (ms): 103.74
Median TTFT (ms): 46.57
Mean TPOT (ms): 5.60
Median ITL (ms): 5.78
==================================================

Latency Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 25.31
Total input tokens: 6164
Total input vision tokens: 5340
Total generated tokens: 4220
Request throughput (req/s): 0.40
Output token throughput (tok/s): 166.70
Total token throughput (tok/s): 410.20
Mean TTFT (ms): 129.22
Median TTFT (ms): 132.54
Mean TPOT (ms): 5.68
Median ITL (ms): 5.75
==================================================

Throughput Benchmark (Text)

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 138.98
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 7.20
Output token throughput (tok/s): 3675.81
Peak output token throughput (tok/s): 4799.00
Total token throughput (tok/s): 7365.91
Mean TTFT (ms): 153.77
Mean TPOT (ms): 25.95
Median ITL (ms): 20.23
==================================================

Throughput Benchmark (Image)

============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 186.38
Total input tokens: 621146
Total input vision tokens: 534000
Total generated tokens: 510855
Request throughput (req/s): 5.37
Output token throughput (tok/s): 2740.86
Peak output token throughput (tok/s): 4962.00
Total token throughput (tok/s): 6073.47
Mean TTFT (ms): 854.71
Mean TPOT (ms): 34.64
Median ITL (ms): 19.08
==================================================

gemma-4-31B-it (1x MI300X, TP=1)

Server Launch Command:

sglang serve --model-path google/gemma-4-31B-it

Note: The 31B dense model fits on a single MI300X (192 GB VRAM) at TP=1, unlike H200 (141 GB) which requires TP=2.

Latency Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 10 --max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 103.55
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.10
Output token throughput (tok/s): 40.75
Total token throughput (tok/s): 99.67
Mean TTFT (ms): 152.35
Median TTFT (ms): 169.66
Mean TPOT (ms): 24.13
Median ITL (ms): 24.23
==================================================

Throughput Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 1000 --max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 441.59
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 2.26
Output token throughput (tok/s): 1156.85
Peak output token throughput (tok/s): 1759.00
Total token throughput (tok/s): 2318.19
Mean TTFT (ms): 819.22
Mean TPOT (ms): 82.51
Median ITL (ms): 63.45
==================================================

gemma-4-26B-A4B-it (MoE, 1x MI300X, TP=1)

Server Launch Command:

sglang serve --model-path google/gemma-4-26B-A4B-it

Latency Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 10 --max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 43.73
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.23
Output token throughput (tok/s): 96.49
Total token throughput (tok/s): 236.00
Mean TTFT (ms): 185.58
Median TTFT (ms): 90.18
Mean TPOT (ms): 9.78
Median ITL (ms): 9.57
==================================================

Throughput Benchmark (Text)

python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 1000 --max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 219.43
Total input tokens: 512842
Total generated tokens: 510855
Request throughput (req/s): 4.56
Output token throughput (tok/s): 2328.05
Peak output token throughput (tok/s): 3500.00
Total token throughput (tok/s): 4665.16
Mean TTFT (ms): 168.44
Mean TPOT (ms): 41.23
Median ITL (ms): 29.31
==================================================

5.2 Accuracy Benchmark

Test Environment:

  • Hardware: H200
  • SGLang Version: gemma4 branch

MMLU

ModelHumanitiesSocial SciencesSTEMOtherOverall
gemma-4-E2B-it0.6210.7390.8300.7360.720
gemma-4-E4B-it0.7030.8620.9020.8250.810
gemma-4-31B-it0.8780.9210.8840.9110.896
gemma-4-26B-A4B-it0.8530.9060.9380.8860.891

GSM8K

ModelAccuracyInvalidLatency (s)Output Throughput (tok/s)
gemma-4-E2B-it0.1700.0003.9908041.739
gemma-4-E4B-it0.7450.0004.1744672.030
gemma-4-31B-it0.8050.00516.1481559.914
gemma-4-26B-A4B-it0.4500.01013.0014089.457

MMMU

ModelOverall
gemma-4-E2B-it0.307
gemma-4-E4B-it0.396
gemma-4-31B-it0.589
gemma-4-26B-A4B-it0.549
MMMU detailed scores (per domain)

gemma-4-E2B-it

{"Overall-Art and Design": {"num": 120, "acc": 0.45}, "Art": {"num": 30, "acc": 0.5}, "Art_Theory": {"num": 30, "acc": 0.467}, "Design": {"num": 30, "acc": 0.5}, "Music": {"num": 30, "acc": 0.333}, "Overall-Business": {"num": 150, "acc": 0.26}, "Accounting": {"num": 30, "acc": 0.367}, "Economics": {"num": 30, "acc": 0.233}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.267}, "Overall-Science": {"num": 150, "acc": 0.273}, "Biology": {"num": 30, "acc": 0.233}, "Chemistry": {"num": 30, "acc": 0.267}, "Geography": {"num": 30, "acc": 0.367}, "Math": {"num": 30, "acc": 0.233}, "Physics": {"num": 30, "acc": 0.267}, "Overall-Health and Medicine": {"num": 150, "acc": 0.273}, "Basic_Medical_Science": {"num": 30, "acc": 0.5}, "Clinical_Medicine": {"num": 30, "acc": 0.233}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.233}, "Pharmacy": {"num": 30, "acc": 0.3}, "Public_Health": {"num": 30, "acc": 0.1}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.4}, "History": {"num": 30, "acc": 0.4}, "Literature": {"num": 30, "acc": 0.567}, "Sociology": {"num": 30, "acc": 0.333}, "Psychology": {"num": 30, "acc": 0.3}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.252}, "Agriculture": {"num": 30, "acc": 0.333}, "Architecture_and_Engineering": {"num": 30, "acc": 0.267}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.1}, "Energy_and_Power": {"num": 30, "acc": 0.3}, "Materials": {"num": 30, "acc": 0.2}, "Mechanical_Engineering": {"num": 30, "acc": 0.333}, "Overall": {"num": 900, "acc": 0.307}}

gemma-4-E4B-it

{"Overall-Art and Design": {"num": 120, "acc": 0.458}, "Art": {"num": 30, "acc": 0.433}, "Art_Theory": {"num": 30, "acc": 0.567}, "Design": {"num": 30, "acc": 0.667}, "Music": {"num": 30, "acc": 0.167}, "Overall-Business": {"num": 150, "acc": 0.287}, "Accounting": {"num": 30, "acc": 0.233}, "Economics": {"num": 30, "acc": 0.467}, "Finance": {"num": 30, "acc": 0.133}, "Manage": {"num": 30, "acc": 0.3}, "Marketing": {"num": 30, "acc": 0.3}, "Overall-Science": {"num": 150, "acc": 0.28}, "Biology": {"num": 30, "acc": 0.333}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.4}, "Math": {"num": 30, "acc": 0.2}, "Physics": {"num": 30, "acc": 0.333}, "Overall-Health and Medicine": {"num": 150, "acc": 0.427}, "Basic_Medical_Science": {"num": 30, "acc": 0.4}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.4}, "Pharmacy": {"num": 30, "acc": 0.4}, "Public_Health": {"num": 30, "acc": 0.4}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.7}, "History": {"num": 30, "acc": 0.633}, "Literature": {"num": 30, "acc": 0.867}, "Sociology": {"num": 30, "acc": 0.733}, "Psychology": {"num": 30, "acc": 0.567}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.324}, "Agriculture": {"num": 30, "acc": 0.533}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.367}, "Electronics": {"num": 30, "acc": 0.133}, "Energy_and_Power": {"num": 30, "acc": 0.4}, "Materials": {"num": 30, "acc": 0.2}, "Mechanical_Engineering": {"num": 30, "acc": 0.333}, "Overall": {"num": 900, "acc": 0.396}}

gemma-4-31B-it

{"Overall-Art and Design": {"num": 120, "acc": 0.667}, "Art": {"num": 30, "acc": 0.667}, "Art_Theory": {"num": 30, "acc": 0.867}, "Design": {"num": 30, "acc": 0.8}, "Music": {"num": 30, "acc": 0.333}, "Overall-Business": {"num": 150, "acc": 0.573}, "Accounting": {"num": 30, "acc": 0.633}, "Economics": {"num": 30, "acc": 0.733}, "Finance": {"num": 30, "acc": 0.433}, "Manage": {"num": 30, "acc": 0.533}, "Marketing": {"num": 30, "acc": 0.533}, "Overall-Science": {"num": 150, "acc": 0.527}, "Biology": {"num": 30, "acc": 0.667}, "Chemistry": {"num": 30, "acc": 0.567}, "Geography": {"num": 30, "acc": 0.5}, "Math": {"num": 30, "acc": 0.267}, "Physics": {"num": 30, "acc": 0.633}, "Overall-Health and Medicine": {"num": 150, "acc": 0.673}, "Basic_Medical_Science": {"num": 30, "acc": 0.733}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.467}, "Pharmacy": {"num": 30, "acc": 0.8}, "Public_Health": {"num": 30, "acc": 0.833}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.825}, "History": {"num": 30, "acc": 0.833}, "Literature": {"num": 30, "acc": 0.867}, "Sociology": {"num": 30, "acc": 0.767}, "Psychology": {"num": 30, "acc": 0.833}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.405}, "Agriculture": {"num": 30, "acc": 0.667}, "Architecture_and_Engineering": {"num": 30, "acc": 0.2}, "Computer_Science": {"num": 30, "acc": 0.567}, "Electronics": {"num": 30, "acc": 0.333}, "Energy_and_Power": {"num": 30, "acc": 0.533}, "Materials": {"num": 30, "acc": 0.3}, "Mechanical_Engineering": {"num": 30, "acc": 0.233}, "Overall": {"num": 900, "acc": 0.589}}

gemma-4-26B-A4B-it

{"Overall-Art and Design": {"num": 120, "acc": 0.717}, "Art": {"num": 30, "acc": 0.733}, "Art_Theory": {"num": 30, "acc": 0.833}, "Design": {"num": 30, "acc": 0.867}, "Music": {"num": 30, "acc": 0.433}, "Overall-Business": {"num": 150, "acc": 0.493}, "Accounting": {"num": 30, "acc": 0.533}, "Economics": {"num": 30, "acc": 0.533}, "Finance": {"num": 30, "acc": 0.333}, "Manage": {"num": 30, "acc": 0.5}, "Marketing": {"num": 30, "acc": 0.567}, "Overall-Science": {"num": 150, "acc": 0.473}, "Biology": {"num": 30, "acc": 0.633}, "Chemistry": {"num": 30, "acc": 0.367}, "Geography": {"num": 30, "acc": 0.533}, "Math": {"num": 30, "acc": 0.267}, "Physics": {"num": 30, "acc": 0.567}, "Overall-Health and Medicine": {"num": 150, "acc": 0.62}, "Basic_Medical_Science": {"num": 30, "acc": 0.767}, "Clinical_Medicine": {"num": 30, "acc": 0.533}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.433}, "Pharmacy": {"num": 30, "acc": 0.7}, "Public_Health": {"num": 30, "acc": 0.667}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.758}, "History": {"num": 30, "acc": 0.8}, "Literature": {"num": 30, "acc": 0.833}, "Sociology": {"num": 30, "acc": 0.733}, "Psychology": {"num": 30, "acc": 0.667}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.376}, "Agriculture": {"num": 30, "acc": 0.633}, "Architecture_and_Engineering": {"num": 30, "acc": 0.367}, "Computer_Science": {"num": 30, "acc": 0.533}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.549}}

ASR

ModelWERAvg Latency (s)Throughput (req/s)
gemma-4-E2B-it23.86%0.2122.99
gemma-4-E4B-it29.55%0.3662.46
gemma-4-31B-itNot Supported
gemma-4-26B-A4B-itNot Supported

FLEUR (EN_US)

ModelWERAvg Latency (s)Throughput (req/s)
gemma-4-E2B-it7.37%0.8963s16.25
gemma-4-E4B-it6.08%0.8707s16.20
gemma-4-31B-itNot Supported
gemma-4-26B-A4B-itNot Supported

5.3 Logits correctness validation

gemma-4-E2B-it

$ python -m sglang.bench_one_batch --correct --model gg-hf-gg/gemma-4-E2B-it ....
prefill logits (final): tensor([[-25.3063, -2.5718, -10.3674, ..., -25.3779, -25.5181, -25.2337]],
device='cuda:0')
....

$ python scripts/playground/reference_hf.py --model-path gg-hf-gg/gemma-4-E2B-it
....
prefill logits (final) tensor([-25.3281, -2.1367, -10.2266, ..., -25.4375, -25.5000, -25.2500],
device='cuda:0', dtype=torch.float16)
....

gemma-4-E4B-it

$ python -m sglang.bench_one_batch --correct --model gg-hf-gg/gemma-4-E4B-it ....
prefill logits (final): tensor([[-17.6478, 7.9901, -5.6505, ..., -17.5658, -17.6478, -17.7293]],
device='cuda:0')
....

$ python scripts/playground/reference_hf.py --model-path gg-hf-gg/gemma-4-E4B-it
....
prefill logits (final) tensor([-17.5625, 8.0469, -5.5742, ..., -17.4688, -17.5625, -17.6719],
device='cuda:0', dtype=torch.float16)
....

gemma-4-31B-it

$ python -m sglang.bench_one_batch --correct --model gg-hf-gg/gemma-4-31B-it ....
prefill logits (final): tensor([[-2.0748, 1.1245, -7.4356, ..., -2.1059, -2.1525, -2.2303]],
device='cuda:0')
....

$ python scripts/playground/reference_hf.py --model-path gg-hf-gg/gemma-4-31B-it
....
prefill logits (final) tensor([-2.1133, 1.2656, -7.4766, ..., -2.1523, -2.2012, -2.2695],
device='cuda:0', dtype=torch.float16)
....