MiniMax-M2.7
1. Model Introduction
MiniMax-M2.7 is MiniMax's first model deeply participating in its own evolution. Built for real-world productivity, M2.7 excels at building complex agent harnesses and completing highly elaborate productivity tasks, leveraging Agent Teams, complex Skills, and dynamic tool search.
Key highlights:
- Model Self-Evolution: During development, M2.7 updates its own memory, builds complex skills for RL experiments, and improves its own learning process. An internal version autonomously optimized a programming scaffold over 100+ rounds, achieving a 30% performance improvement. On MLE Bench Lite, M2.7 achieved a 66.6% medal rate.
- Professional Software Engineering: Delivers outstanding real-world programming capabilities. On SWE-Pro, M2.7 achieved 56.22%, with strong results on SWE Multilingual (76.5) and Multi SWE Bench (52.7). On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), M2.7 demonstrates deep understanding of complex engineering systems.
- Professional Work: Achieved an ELO score of 1495 on GDPval-AA (highest among open-source models). On Toolathon, M2.7 reached 46.3% accuracy (global top tier).
- Native Agent Teams: Supports multi-agent collaboration with stable role identity and autonomous decision-making.
For more details, see the official MiniMax-M2.7 blog post.
License: Modified-MIT (MiniMax Model License)
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
Docker Images by Hardware Platform:
| Hardware Platform | Docker Image |
|---|---|
| NVIDIA A100 / H100 / H200 / B200 | lmsysorg/sglang:v0.5.10.post1 |
| NVIDIA B300 / GB300 | lmsysorg/sglang:v0.5.10.post1-cu130 |
| AMD MI300X / MI325X | lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x |
| AMD MI355X | lmsysorg/sglang:v0.5.10.post1-rocm720-mi35x |
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and feature capabilities.
sglang serve \ --model-path MiniMaxAI/MiniMax-M2.7 \ --tp 4 \ --tool-call-parser minimax-m2 \ --reasoning-parser minimax-append-think \ --trust-remote-code \ --mem-fraction-static 0.85
3.2 Configuration Tips
Key Parameters:
| Parameter | Description | Recommended Value |
|---|---|---|
--tool-call-parser | Tool call parser for function calling support | minimax-m2 |
--reasoning-parser | Reasoning parser for thinking mode | minimax-append-think |
--trust-remote-code | Required for MiniMax model loading | Always enabled |
--mem-fraction-static | Static memory fraction for KV cache | 0.85 |
--tp | Tensor parallelism size | 2 / 4 / 8 depending on hardware |
--ep | Expert parallelism size | 8 (NVIDIA 8-GPU) or EP=TP (AMD) |
--kv-cache-dtype | KV cache data type (AMD only) | fp8_e4m3 |
--attention-backend | Attention backend (AMD only) | triton |
Hardware Requirements: NVIDIA
- 4-GPU deployment: Requires 4× high-memory GPUs (e.g., H200, B200, A100, H100) with TP=4
- 8-GPU deployment: Requires 8× GPUs (e.g., H200, B200, A100, H100) with TP=8 and EP=8
Hardware Requirements: NVIDIA GB300
- 2-GPU deployment: GB300 (275GB per die) can host the model with TP=2
- 4-GPU deployment: Maximum single-node TP for GB300, recommended for higher throughput
Hardware Requirements: AMD
- 2-GPU deployment: Requires 2× high-memory GPUs (e.g., MI300X, MI325X, MI355X) with TP=2, EP=2
- 4-GPU deployment: Requires 4× GPUs (e.g., MI300X, MI325X, MI355X) with TP=4, EP=4
- 8-GPU deployment: Requires 8× GPUs (e.g., MI300X, MI325X, MI355X) with TP=8, EP=8
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:
Deployment Command:
sglang serve \
--model-path MiniMaxAI/MiniMax-M2.7 \
--tp 4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--trust-remote-code \
--mem-fraction-static 0.85
Testing Deployment:
After startup, you can test the SGLang OpenAI-compatible API with the following command:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2.7",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
Simple Completion Example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
max_tokens=1024
)
print(response.choices[0].message.content)
Example Output:
<think>The user asks: "Who won the World Series in 2020?" That's a simple factual question. The answer: the Los Angeles Dodgers won the 2020 MLB World Series, defeating the Tampa Bay Rays. So answer accordingly.
We must be mindful of policy: it's a factual question about sports. It's allowed. Provide answer with brief context.
We should answer concisely.
Hence final answer: The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games (best-of-seven series). Possibly mention it was played at a neutral site due to COVID-19, at Globe Life Field in Arlington, Texas.
We must avoid disallowed content, no issue.
Thus final.
</think>
The **Los Angeles Dodgers** won the 2020 World Series. They defeated the **Tampa Bay Rays** in six games (4‑2) in a best‑of‑seven series that was played at Globe Life Field in Arlington, Texas, under the MLB bubble‑like arrangements for the COVID‑19 pandemic.
4.2 Advanced Usage
4.2.1 Reasoning Parser
MiniMax-M2.7 supports Thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections:
sglang serve \
--model-path MiniMaxAI/MiniMax-M2.7 \
--tp 4 \
--reasoning-parser minimax-append-think \
--trust-remote-code \
--mem-fraction-static 0.85
Streaming with Thinking Process
With minimax-append-think, the thinking content is wrapped in <think>...</think> tags within the content field. You can parse these tags on the client side to separate the thinking and content sections:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
stream=True
)
# Process the stream, separating <think>...</think> from content
in_think = False
think_printed_header = False
content_printed_header = False
buffer = ""
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if delta.content:
buffer += delta.content
while buffer:
if in_think:
# Look for closing </think> tag
end_idx = buffer.find("</think>")
if end_idx != -1:
print(buffer[:end_idx], end="", flush=True)
buffer = buffer[end_idx + len("</think>"):]
in_think = False
else:
# Still in thinking, print what we have
print(buffer, end="", flush=True)
buffer = ""
else:
# Look for opening <think> tag
start_idx = buffer.find("<think>")
if start_idx != -1:
# Print any content before <think>
before = buffer[:start_idx]
if before:
if not content_printed_header:
print("=============== Content =================", flush=True)
content_printed_header = True
print(before, end="", flush=True)
buffer = buffer[start_idx + len("<think>"):]
in_think = True
if not think_printed_header:
print("=============== Thinking =================", flush=True)
think_printed_header = True
else:
# No <think> tag, print as content
if not content_printed_header and think_printed_header:
print("\n=============== Content =================", flush=True)
content_printed_header = True
print(buffer, end="", flush=True)
buffer = ""
print()
Output Example:
=============== Thinking =================
The user asks: "Solve this problem step by step: What is 15% of 240?" Straightforward. Provide solution: 15% = 15/100 = 0.15. Multiply 240 * 0.15 = 36. Show steps. So answer: 36. Provide explanation.
But also ensure we follow any policy? No issues. Just straightforward.
I'll provide a step-by-step solution.
Also could show fraction: 15% = 15/100 = 3/20, multiply 240 * 3/20 = (240/20)*3 = 12*3 = 36.
Yes. Provide final answer. Also show verification: 10% of 240 is 24, 5% is 12, total 36.
All good.
=============== Content =================
**Step‑by‑step solution**
1. **Convert the percent to a decimal (or a fraction).**
15% = 15/100 = 0.15 = 3/20
2. **Multiply the original number (240) by this decimal/fraction.**
Using the decimal:
240 × 0.15 = 36
Or using the fraction:
240 × 3/20 = (240/20) × 3 = 12 × 3 = 36
3. **Result:**
15% of 240 = **36**
*Check:*
- 10% of 240 = 24
- 5% of 240 = 12
- Adding them: 24 + 12 = 36, which matches the calculation.
Note: The minimax-append-think reasoning parser embeds the thinking process in <think>...</think> tags within the content field. The code above parses these tags in real-time to display thinking and content separately.
4.2.2 Tool Calling
MiniMax-M2.7 supports tool calling capabilities. Enable the tool call parser:
sglang serve \
--model-path MiniMaxAI/MiniMax-M2.7 \
--tp 4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--trust-remote-code \
--mem-fraction-static 0.85
Python Example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Non-streaming request
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools
)
message = response.choices[0].message
# Check for tool calls
if message.tool_calls:
for tool_call in message.tool_calls:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
else:
print(message.content)
Output Example:
Tool Call: get_weather
Arguments: {"location": "Beijing"}
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=messages
)
print(final_response.choices[0].message.content)
Output Example:
The weather in Beijing is currently 22°C and sunny.
5. Benchmark
This section uses industry-standard configurations for comparable benchmark results.
Test Environment:
- Hardware: 2× NVIDIA GB300 (275GB per die)
- Docker Image:
lmsysorg/sglang:v0.5.10.post1-cu130 - Model: MiniMax-M2.7 (FP8)
- Tensor Parallelism: 2
- SGLang version: 0.5.10.post1
5.1 Accuracy Benchmark
Evaluation Tool: NVIDIA NeMo-Skills
Evaluation Settings: temperature=0.6, top_p=0.95, 8 seeds, max_tokens=120,000, parse_reasoning=True
5.1.1 GPQA Diamond
- Dataset: GPQA Diamond (198 questions)
- Prompt:
eval/aai/mcq-4choices(4-choice multiple choice, matching Artificial Analysis methodology) - Evaluation command:
ns prepare_data gpqa
ns eval \
--cluster=local \
--server_type=openai \
--model=MiniMaxAI/MiniMax-M2.7 \
--server_address=http://localhost:30000/v1 \
--output_dir=./m2.7-eval/ \
--benchmarks=gpqa:8 \
++prompt_config=eval/aai/mcq-4choices \
++inference.tokens_to_generate=120000 \
++inference.temperature=0.6 \
++inference.top_p=0.95 \
++parse_reasoning=True
- Test Results:
| Evaluation Mode | Accuracy | No Answer |
|---|---|---|
| pass@1 (avg-of-8) | 84.91% | 3.54% |
| majority@8 | 88.89% | 0.00% |
| pass@8 | 96.46% | 0.00% |
5.1.2 AIME 2025
- Dataset: AIME 2025 (30 problems)
- Prompt:
generic/math(boxed answer format) - Evaluation command:
ns prepare_data aime25
ns eval \
--cluster=local \
--server_type=openai \
--model=MiniMaxAI/MiniMax-M2.7 \
--server_address=http://localhost:30000/v1 \
--output_dir=./m2.7-eval/ \
--benchmarks=aime25:8 \
++inference.tokens_to_generate=120000 \
++inference.temperature=0.6 \
++inference.top_p=0.95 \
++parse_reasoning=True
- Test Results:
| Evaluation Mode | Accuracy | No Answer |
|---|---|---|
| pass@1 (avg-of-8) | 92.50% ± 5.56% | 2.92% |
| majority@8 | 97.08% | 0.00% |
| pass@8 | 100.00% | 0.00% |
5.1.3 MMLU-Pro
- Dataset: MMLU-Pro (12,032 questions, 10-choice)
- Prompt:
eval/aai/mcq-10choices(10-choice multiple choice) - Evaluation command:
ns prepare_data mmlu-pro
ns eval \
--cluster=local \
--server_type=openai \
--model=MiniMaxAI/MiniMax-M2.7 \
--server_address=http://localhost:30000/v1 \
--output_dir=./m2.7-eval/ \
--benchmarks=mmlu-pro \
++prompt_config=eval/aai/mcq-10choices \
++inference.tokens_to_generate=32768 \
++inference.temperature=0.0 \
++parse_reasoning=True
- Test Results:
| Evaluation Mode | Accuracy | No Answer |
|---|---|---|
| pass@1 (greedy) | 69.41% | 18.75% |
Note: The high no-answer rate is due to the 32K token limit being insufficient for M2.7's extended thinking on some questions. A rerun with 120K tokens is expected to improve accuracy significantly.
5.1.4 GSM8K Benchmark
- Benchmark Method: 8-shot Chain-of-Thought, evaluated via OpenAI-compatible API
- Test Results:
GSM8K Results (8-shot CoT)
Model: MiniMaxAI/MiniMax-M2.7
Total: 1319
Correct: 1218
Accuracy: 92.34%
5.2 Speed Benchmark
5.2.1 Low Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model MiniMaxAI/MiniMax-M2.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 34.33
Total input tokens: 6101
Total generated tokens: 4220
Request throughput (req/s): 0.29
Input token throughput (tok/s): 177.71
Output token throughput (tok/s): 122.92
Total token throughput (tok/s): 300.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3431.21
Median E2E Latency (ms): 2742.57
---------------Time to First Token----------------
Mean TTFT (ms): 50.28
Median TTFT (ms): 53.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.02
Median TPOT (ms): 8.01
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.03
Median ITL (ms): 8.02
==================================================
5.2.2 High Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model MiniMaxAI/MiniMax-M2.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 100.20
Total input tokens: 249831
Total generated tokens: 252662
Request throughput (req/s): 4.99
Input token throughput (tok/s): 2493.41
Output token throughput (tok/s): 2521.66
Total token throughput (tok/s): 5015.07
Concurrency: 90.19
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18072.69
Median E2E Latency (ms): 17761.84
---------------Time to First Token----------------
Mean TTFT (ms): 247.94
Median TTFT (ms): 92.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 35.75
Median TPOT (ms): 36.67
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.34
Median ITL (ms): 30.55
==================================================