independent · reproducible · skeptical

Somebody should check the numbers.

Every open model comes with benchmark scores. Most of those scores came from the people who built it. SanityBench reruns the tests, publishes the methodology, and uploads the logs. The weird stuff gets extra attention: MoE, Mamba, hybrids, diffusion models, whatever somebody decided to build at 3 A.M. If a result can't be reproduced, it doesn't belong here.

See the leaderboard ↓ How I run it

Leaderboard

Independent reproductions of open-weight language models · 5 models evaluated · last updated May 2026

Architecture

Model	Vendor	Arch	Total	Active	GPQA	GSM8K	MMLU-Pro	HumanEval	AIME
DeepSeek-V2-Lite-Base	DeepSeek	MoE	15.7B	2.4B	29.80	37.45	25.48	22.56	0.00
Ling-mini-2.0	Ant Group / InclusionAI	MoE	16.0B	1.4B	37.88	80.89	53.34	72.56	16.70
Moonlight-16B-A3B-Base	Moonshot AI	MoE	16.0B	3.0B	30.30	73.92	blocked	blocked	blocked
Ring-mini-2.0	Ant Group / InclusionAI	MoE	16.0B	1.4B	37.88	79.76	54.52	65.24	10.00
SmolLM3-3B-Base	Hugging Face	dense	3.0B	3.0B	35.35	67.48	35.10	29.27	0.00