| Benchmark | Category | Measured | Claimed | Setup |
|---|---|---|---|---|
| GPQA Diamond | reasoning | 30.30 | — | 0-shot |
| 0-shot, loglikelihood. Reproduced across 3 configs. Loglikelihood scoring never generates tokens, so it is structurally immune to the degeneration that blocked the generative benchmarks. Only ~1.6 stderr above the 25% floor. | ||||
| GSM8K | math | 73.92 | 77.40 | 5-shot |
| 5-shot, strict-match. Reproduced across 3 configs (fp16 verified-good config reported). Its prompt distribution does not trigger the degeneration. | ||||
| MMLU-Pro | knowledge | blocked | — | 5-shot |
| Blocked by input-dependent numerical degeneration under vLLM: many prompts produced garbage output, with whole MMLU-Pro categories collapsing to zero. Not a model-capability score, a deployment-stability failure. | ||||
| HumanEval+ | code | blocked | — | 0-shot |
| Blocked by the same vLLM numerical degeneration affecting generative benchmarks. | ||||
| AIME 2024 | math | blocked | — | 0-shot |
| Blocked by the same vLLM numerical degeneration affecting generative benchmarks. | ||||
How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.