Runs
11
Best score
98/100
Avg pass rate
91.7%
Total cost
$1.565433
Benchmark Results
Latency 是平均单题延迟;Total Time 是根据逐题 started_at 和 latency_ms 估算的整轮运行耗时。Token Usage 汇总每次完整运行的输入、输出侧和 hidden reasoning 消耗:Prompt = 输入题目与系统提示;Completion = 输出侧总量;Reasoning = hidden thinking/reasoning token。
| Model | Run | Mode | Score | Suite Breakdown | Latency | Total Time | Cost | Token Usage | Finish |
|---|---|---|---|---|---|---|---|---|---|
z-ai/glm-5.2v0.3-challenge100-z-ai-glm-5-2-ability16384 |
JSONL | ability16384 temp=mixed |
98/100 98.0% |
chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 25/25 | 10217 ms | 17m 30s | $0.169658 | Prompt10169 Completion35323 Reasoning29340 |
stop:100 |
moonshotai/kimi-k2.7-codev0.3-challenge100-moonshotai-kimi-k2-7-code-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
98/100 98.0% |
chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 18/18 reasoning 24/25 | 9780 ms | 16m 20s | $0.189826 | Prompt9984 Completion46691 Reasoning38156 |
stop:100 |
z-ai/glm-5.1v0.3-challenge100-z-ai-glm-5-1-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
95/100 95.0% |
chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 24/25 | 23829 ms | 39m 45s | $0.311060 | Prompt9373 Completion76288 Reasoning65021 |
length:1, stop:98, unknown:1 |
qwen/qwen3.7-maxv0.3-challenge100-qwen-qwen3-7-max-ability16384 |
JSONL | ability16384 temp=mixed |
94/100 94.0% |
chinese_writing 4/4 coding 27/28 debugging 18/19 instruction_following 6/6 rag_long_context 14/18 reasoning 25/25 | 11653 ms | 19m 27s | $0.199764 | Prompt10627 Completion49728 Reasoning41243 |
stop:100 |
qwen/qwen3.7-plusv0.3-challenge100-qwen-qwen3-7-plus-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
93/100 93.0% |
chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 24/25 | 22070 ms | 37m 15s | $0.151311 | Prompt10627 Completion115555 Reasoning108335 |
stop:100 |
xiaomi/mimo-v2.5-prov0.3-challenge100-xiaomi-mimo-v2-5-pro-ability16384 |
JSONL | ability16384 temp=mixed |
92/100 92.0% |
chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 22/25 | 26795 ms | 44m 41s | $0.087083 | Prompt10368 Completion83683 Reasoning74311 |
stop:100 |
deepseek/deepseek-v4-prov0.3-challenge100-deepseek-deepseek-v4-pro-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
91/100 91.0% |
chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 6/6 rag_long_context 16/18 reasoning 20/25 | 11827 ms | 19m 45s | $0.110125 | Prompt9385 Completion44207 Reasoning36756 |
stop:100 |
minimax/minimax-m3v0.3-challenge100-minimax-minimax-m3-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
88/100 88.0% |
chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 5/6 rag_long_context 17/18 reasoning 17/25 | 19630 ms | 32m 45s | $0.124947 | Prompt25254 Completion81004 Reasoning65731 |
length:1, stop:99 |
stepfun/step-3.7-flashv0.3-challenge100-stepfun-step-3-7-flash-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
88/100 88.0% |
chinese_writing 4/4 coding 22/28 debugging 16/19 instruction_following 6/6 rag_long_context 15/18 reasoning 25/25 | 14435 ms | 25m 06s | $0.163351 | Prompt10685 Completion140186 Reasoning0 |
length:2, stop:98 |
deepseek/deepseek-v4-flashv0.3-challenge100-deepseek-deepseek-v4-flash-ability16384-rescored rescored |
JSONL | ability16384 temp=mixed |
87/100 87.0% |
chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 19/25 | 8094 ms | 13m 32s | $0.009602 | Prompt9572 Completion31505 Reasoning24626 |
stop:100 |
tencent/hy3-previewv0.3-challenge100-tencent-hy3-preview-ability16384 |
JSONL | ability16384 temp=mixed |
85/100 85.0% |
chinese_writing 4/4 coding 21/28 debugging 15/19 instruction_following 6/6 rag_long_context 16/18 reasoning 23/25 | 19290 ms | 32m 10s | $0.048706 | Prompt9930 Completion180894 Reasoning142453 |
length:1, stop:96, unknown:3 |