Moonlight-16B-A3B-Base

Moonshot AI

Apache-2.0 MoE MLA Muon-optimizer base

Type

MoE

Total params

16.0B

Active params

3.0B

Sparsity

Context

8,192

Train tokens

5.7T

Benchmarks

Benchmark	Category	Measured	Claimed	Setup
GPQA Diamond	reasoning	30.30	—	0-shot
0-shot, loglikelihood. Reproduced across 3 configs. Loglikelihood scoring never generates tokens, so it is structurally immune to the degeneration that blocked the generative benchmarks. Only ~1.6 stderr above the 25% floor.
GSM8K	math	73.92	77.40	5-shot
5-shot, strict-match. Reproduced across 3 configs (fp16 verified-good config reported). Its prompt distribution does not trigger the degeneration.
MMLU-Pro	knowledge	blocked	—	5-shot
Blocked by input-dependent numerical degeneration under vLLM: many prompts produced garbage output, with whole MMLU-Pro categories collapsing to zero. Not a model-capability score, a deployment-stability failure.
HumanEval+	code	blocked	—	0-shot
Blocked by the same vLLM numerical degeneration affecting generative benchmarks.
AIME 2024	math	blocked	—	0-shot
Blocked by the same vLLM numerical degeneration affecting generative benchmarks.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

Partial evaluation with an important reproducibility finding. Two benchmarks completed cleanly and were reproduced across three independent configurations. The generative benchmarks could not be measured reliably: under the standard vLLM attention path on this hardware the base model exhibits input-dependent numerical degeneration, producing garbage output (repeated '!') on a large subset of prompts. That instability is itself a documented result here, and the vendor's 'deploys easily on vLLM' claim does not mention it. Notable for training with the Muon optimizer rather than Adam.