Ministral-3
1. Model Introduction
The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities.
The Ministral 3 14B Instruct model offers the following capabilities:
Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text. Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic. System Prompt: Maintains strong adherence and support for system prompts. Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting. Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere. Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes. Large Context Window: Supports a 256k context window.
For further details, please refer to the official documentation
2. SGLang Installation
Please refer to the official SGLang installation guide for installation instructions.
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.
sglang serve \ --model mistralai/Ministral-3-8B-Instruct-2512 \ --trust-remote-code \ --tool-call-parser mistral
3.2 Configuration Tips
Context length vs memory: Ministral-3 advertises a long context window; if you are memory-constrained, start by lowering --context-length (for example 32768) and increase once things are stable.
Pre-installation steps: Adding the following steps after launching the docker
pip install mistral-common --upgrade
pip install transformers==5.0.0.rc0
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:
4.2 Advanced Usage
4.2.1 Launch the docker
docker pull lmsysorg/sglang:v0.5.9-rocm720-mi30x
docker run -d -it --ipc=host --network=host --privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
--group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /:/work \
-e SHELL=/bin/bash \
--name Ministral \
lmsysorg/sglang:v0.5.9-rocm720-mi30x \
/bin/bash
4.2.2 Launch the server
sglang serve \
--model-path mistralai/Ministral-3-14B-Instruct-2512 \
--tp 1 \
--trust-remote-code
5. Benchmark
This section uses industry-standard configurations for comparable benchmark results.
5.1 Speed Benchmark
Test Environment:
-
Hardware: MI300X GPU (8x)
-
Model: mistralai/Ministral-3-14B-Instruct-2512
-
Tensor Parallelism: 1
-
SGLang Version: 0.5.7
-
Model Deployment Command:
sglang serve \
--model-path mistralai/Ministral-3-14B-Instruct-2512 \
--tp 1 \
--trust-remote-code
Low Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model mistralai/Ministral-3-14B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 65.08
Total input tokens: 6101
Total input text tokens: 6101
Total input vision tokens: 0
Total generated tokens: 4220
Total generated tokens (retokenized): 4218
Request throughput (req/s): 0.15
Input token throughput (tok/s): 93.75
Output token throughput (tok/s): 64.84
Peak output token throughput (tok/s): 151.00
Peak concurrent requests: 2
Total token throughput (tok/s): 158.59
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 6505.51
Median E2E Latency (ms): 3037.37
---------------Time to First Token----------------
Mean TTFT (ms): 3709.33
Median TTFT (ms): 53.72
P99 TTFT (ms): 33320.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 6.63
Median TPOT (ms): 6.64
P99 TPOT (ms): 6.66
---------------Inter-Token Latency----------------
Mean ITL (ms): 6.64
Median ITL (ms): 6.65
P95 ITL (ms): 6.75
P99 ITL (ms): 6.82
Max ITL (ms): 8.45
==================================================
Medium Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model mistralai/Ministral-3-14B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 31.20
Total input tokens: 39668
Total input text tokens: 39668
Total input vision tokens: 0
Total generated tokens: 40805
Total generated tokens (retokenized): 40783
Request throughput (req/s): 2.56
Input token throughput (tok/s): 1271.38
Output token throughput (tok/s): 1307.82
Peak output token throughput (tok/s): 1760.00
Peak concurrent requests: 22
Total token throughput (tok/s): 2579.20
Concurrency: 13.72
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 5351.07
Median E2E Latency (ms): 5626.45
---------------Time to First Token----------------
Mean TTFT (ms): 280.87
Median TTFT (ms): 68.16
P99 TTFT (ms): 1194.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 10.47
Median TPOT (ms): 10.10
P99 TPOT (ms): 20.00
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.96
Median ITL (ms): 9.10
P95 ITL (ms): 9.87
P99 ITL (ms): 51.39
Max ITL (ms): 888.63
==================================================
High Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model mistralai/Ministral-3-14B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 88.75
Total input tokens: 249831
Total input text tokens: 249831
Total input vision tokens: 0
Total generated tokens: 252662
Total generated tokens (retokenized): 252547
Request throughput (req/s): 5.63
Input token throughput (tok/s): 2815.01
Output token throughput (tok/s): 2846.91
Peak output token throughput (tok/s): 4271.00
Peak concurrent requests: 110
Total token throughput (tok/s): 5661.93
Concurrency: 93.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 16514.45
Median E2E Latency (ms): 15834.45
---------------Time to First Token----------------
Mean TTFT (ms): 148.57
Median TTFT (ms): 99.15
P99 TTFT (ms): 455.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 32.93
Median TPOT (ms): 34.73
P99 TPOT (ms): 38.05
---------------Inter-Token Latency----------------
Mean ITL (ms): 32.45
Median ITL (ms): 27.30
P95 ITL (ms): 71.73
P99 ITL (ms): 73.45
Max ITL (ms): 328.10
==================================================
5.2 Accuracy Benchmark
Document model accuracy on standard benchmarks:
5.2.1 GSM8K Benchmark
- Benchmark Command
python3 benchmark/gsm8k/bench_sglang.py \
--num-shots 8 \
--num-questions 1316 \
--parallel 1316
Test Results:
Accuracy: 0.959
Invalid: 0.000
Latency: 29.185 s
Output throughput: 4854.672 token/s