DracoBench v0.3-challenge100 ability16384 Results

汇总已生成的 DracoBench 运行结果,点击模型名称进入逐题详情。生成日期:2026-06-17 11:17
Runs
11
Best score
98/100
Avg pass rate
91.7%
Total cost
$1.565433

Benchmark Results

Latency 是平均单题延迟;Total Time 是根据逐题 started_at 和 latency_ms 估算的整轮运行耗时。Token Usage 汇总每次完整运行的输入、输出侧和 hidden reasoning 消耗:Prompt = 输入题目与系统提示;Completion = 输出侧总量;Reasoning = hidden thinking/reasoning token。

Model Run Mode Score Suite Breakdown Latency Total Time Cost Token Usage Finish
z-ai/glm-5.2
v0.3-challenge100-z-ai-glm-5-2-ability16384
JSONL ability16384
temp=mixed
98/100
98.0%
chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 25/25 10217 ms 17m 30s $0.169658
Prompt10169
Completion35323
Reasoning29340
stop:100
moonshotai/kimi-k2.7-code
v0.3-challenge100-moonshotai-kimi-k2-7-code-ability16384-rescored rescored
JSONL ability16384
temp=mixed
98/100
98.0%
chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 18/18 reasoning 24/25 9780 ms 16m 20s $0.189826
Prompt9984
Completion46691
Reasoning38156
stop:100
z-ai/glm-5.1
v0.3-challenge100-z-ai-glm-5-1-ability16384-rescored rescored
JSONL ability16384
temp=mixed
95/100
95.0%
chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 24/25 23829 ms 39m 45s $0.311060
Prompt9373
Completion76288
Reasoning65021
length:1, stop:98, unknown:1
qwen/qwen3.7-max
v0.3-challenge100-qwen-qwen3-7-max-ability16384
JSONL ability16384
temp=mixed
94/100
94.0%
chinese_writing 4/4 coding 27/28 debugging 18/19 instruction_following 6/6 rag_long_context 14/18 reasoning 25/25 11653 ms 19m 27s $0.199764
Prompt10627
Completion49728
Reasoning41243
stop:100
qwen/qwen3.7-plus
v0.3-challenge100-qwen-qwen3-7-plus-ability16384-rescored rescored
JSONL ability16384
temp=mixed
93/100
93.0%
chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 24/25 22070 ms 37m 15s $0.151311
Prompt10627
Completion115555
Reasoning108335
stop:100
xiaomi/mimo-v2.5-pro
v0.3-challenge100-xiaomi-mimo-v2-5-pro-ability16384
JSONL ability16384
temp=mixed
92/100
92.0%
chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 22/25 26795 ms 44m 41s $0.087083
Prompt10368
Completion83683
Reasoning74311
stop:100
deepseek/deepseek-v4-pro
v0.3-challenge100-deepseek-deepseek-v4-pro-ability16384-rescored rescored
JSONL ability16384
temp=mixed
91/100
91.0%
chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 6/6 rag_long_context 16/18 reasoning 20/25 11827 ms 19m 45s $0.110125
Prompt9385
Completion44207
Reasoning36756
stop:100
minimax/minimax-m3
v0.3-challenge100-minimax-minimax-m3-ability16384-rescored rescored
JSONL ability16384
temp=mixed
88/100
88.0%
chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 5/6 rag_long_context 17/18 reasoning 17/25 19630 ms 32m 45s $0.124947
Prompt25254
Completion81004
Reasoning65731
length:1, stop:99
stepfun/step-3.7-flash
v0.3-challenge100-stepfun-step-3-7-flash-ability16384-rescored rescored
JSONL ability16384
temp=mixed
88/100
88.0%
chinese_writing 4/4 coding 22/28 debugging 16/19 instruction_following 6/6 rag_long_context 15/18 reasoning 25/25 14435 ms 25m 06s $0.163351
Prompt10685
Completion140186
Reasoning0
length:2, stop:98
deepseek/deepseek-v4-flash
v0.3-challenge100-deepseek-deepseek-v4-flash-ability16384-rescored rescored
JSONL ability16384
temp=mixed
87/100
87.0%
chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 19/25 8094 ms 13m 32s $0.009602
Prompt9572
Completion31505
Reasoning24626
stop:100
tencent/hy3-preview
v0.3-challenge100-tencent-hy3-preview-ability16384
JSONL ability16384
temp=mixed
85/100
85.0%
chinese_writing 4/4 coding 21/28 debugging 15/19 instruction_following 6/6 rag_long_context 16/18 reasoning 23/25 19290 ms 32m 10s $0.048706
Prompt9930
Completion180894
Reasoning142453
length:1, stop:96, unknown:3