DracoBench v0.3-challenge100 ability16384 Results

Runs

Best score

98/100

Avg pass rate

91.7%

Total cost

$1.565433

Benchmark Results

Latency 是平均单题延迟；Total Time 是根据逐题 started_at 和 latency_ms 估算的整轮运行耗时。Token Usage 汇总每次完整运行的输入、输出侧和 hidden reasoning 消耗：Prompt = 输入题目与系统提示；Completion = 输出侧总量；Reasoning = hidden thinking/reasoning token。

Model	Run	Mode	Score	Suite Breakdown	Latency	Total Time	Cost	Token Usage	Finish
z-ai/glm-5.2 `v0.3-challenge100-z-ai-glm-5-2-ability16384`	JSONL	ability16384 temp=mixed	98/100 98.0%	chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 25/25	10217 ms	17m 30s	$0.169658	Prompt10169 Completion35323 Reasoning29340	stop:100
moonshotai/kimi-k2.7-code `v0.3-challenge100-moonshotai-kimi-k2-7-code-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	98/100 98.0%	chinese_writing 4/4 coding 27/28 debugging 19/19 instruction_following 6/6 rag_long_context 18/18 reasoning 24/25	9780 ms	16m 20s	$0.189826	Prompt9984 Completion46691 Reasoning38156	stop:100
z-ai/glm-5.1 `v0.3-challenge100-z-ai-glm-5-1-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	95/100 95.0%	chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 24/25	23829 ms	39m 45s	$0.311060	Prompt9373 Completion76288 Reasoning65021	length:1, stop:98, unknown:1
qwen/qwen3.7-max `v0.3-challenge100-qwen-qwen3-7-max-ability16384`	JSONL	ability16384 temp=mixed	94/100 94.0%	chinese_writing 4/4 coding 27/28 debugging 18/19 instruction_following 6/6 rag_long_context 14/18 reasoning 25/25	11653 ms	19m 27s	$0.199764	Prompt10627 Completion49728 Reasoning41243	stop:100
qwen/qwen3.7-plus `v0.3-challenge100-qwen-qwen3-7-plus-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	93/100 93.0%	chinese_writing 4/4 coding 25/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 24/25	22070 ms	37m 15s	$0.151311	Prompt10627 Completion115555 Reasoning108335	stop:100
xiaomi/mimo-v2.5-pro `v0.3-challenge100-xiaomi-mimo-v2-5-pro-ability16384`	JSONL	ability16384 temp=mixed	92/100 92.0%	chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 17/18 reasoning 22/25	26795 ms	44m 41s	$0.087083	Prompt10368 Completion83683 Reasoning74311	stop:100
deepseek/deepseek-v4-pro `v0.3-challenge100-deepseek-deepseek-v4-pro-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	91/100 91.0%	chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 6/6 rag_long_context 16/18 reasoning 20/25	11827 ms	19m 45s	$0.110125	Prompt9385 Completion44207 Reasoning36756	stop:100
minimax/minimax-m3 `v0.3-challenge100-minimax-minimax-m3-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	88/100 88.0%	chinese_writing 4/4 coding 26/28 debugging 19/19 instruction_following 5/6 rag_long_context 17/18 reasoning 17/25	19630 ms	32m 45s	$0.124947	Prompt25254 Completion81004 Reasoning65731	length:1, stop:99
stepfun/step-3.7-flash `v0.3-challenge100-stepfun-step-3-7-flash-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	88/100 88.0%	chinese_writing 4/4 coding 22/28 debugging 16/19 instruction_following 6/6 rag_long_context 15/18 reasoning 25/25	14435 ms	25m 06s	$0.163351	Prompt10685 Completion140186 Reasoning0	length:2, stop:98
deepseek/deepseek-v4-flash `v0.3-challenge100-deepseek-deepseek-v4-flash-ability16384-rescored` rescored	JSONL	ability16384 temp=mixed	87/100 87.0%	chinese_writing 4/4 coding 24/28 debugging 19/19 instruction_following 6/6 rag_long_context 15/18 reasoning 19/25	8094 ms	13m 32s	$0.009602	Prompt9572 Completion31505 Reasoning24626	stop:100
tencent/hy3-preview `v0.3-challenge100-tencent-hy3-preview-ability16384`	JSONL	ability16384 temp=mixed	85/100 85.0%	chinese_writing 4/4 coding 21/28 debugging 15/19 instruction_following 6/6 rag_long_context 16/18 reasoning 23/25	19290 ms	32m 10s	$0.048706	Prompt9930 Completion180894 Reasoning142453	length:1, stop:96, unknown:3