sanity·bench
independent · reproducible · skeptical

Somebody should check the numbers.

Every open model comes with benchmark scores. Most of those scores came from the people who built it. SanityBench reruns the tests, publishes the methodology, and uploads the logs. The weird stuff gets extra attention: MoE, Mamba, hybrids, diffusion models, whatever somebody decided to build at 3 A.M. If a result can't be reproduced, it doesn't belong here.

Leaderboard

Independent reproductions of open-weight language models · 5 models evaluated · last updated May 2026

Architecture

# Model Vendor Arch Total Active GPQA GSM8K MMLU-Pro HumanEval AIME
DeepSeek-V2-Lite-Base DeepSeek MoE 15.7B 2.4B 29.80 37.45 25.48 22.56 0.00
Ling-mini-2.0 Ant Group / InclusionAI MoE 16.0B 1.4B 37.88 80.89 53.34 72.56 16.70
Moonlight-16B-A3B-Base Moonshot AI MoE 16.0B 3.0B 30.30 73.92 blocked blocked blocked
Ring-mini-2.0 Ant Group / InclusionAI MoE 16.0B 1.4B 37.88 79.76 54.52 65.24 10.00
SmolLM3-3B-Base Hugging Face dense 3.0B 3.0B 35.35 67.48 35.10 29.27 0.00