| Benchmark | Category | Measured | Claimed | Setup |
|---|---|---|---|---|
| GPQA Diamond | reasoning | 37.88 | Nonenot reported | 0-shot |
| GSM8K | math | 80.89 | Nonenot reported | 5-shot |
| MMLU-Pro | knowledge | 53.34 | 65.10 | 5-shot |
| HumanEval+ | code | 72.56 | Nonenot reported | 0-shot |
| AIME 2024 | math | 16.70 | Nonenot reported | 0-shot |
How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.