sanity·bench

DeepSeek-V2-Lite-Base

DeepSeek
DeepSeek License MoE MLA base
Type
MoE
Total params
15.7B
Active params
2.4B
Sparsity
Context
32,768
Train tokens
5.7T

Benchmarks

Benchmark Category Measured Claimed Setup
GPQA Diamond reasoning 29.80 0-shot
0-shot, loglikelihood
GSM8K math 37.45 41.10 5-shot
5-shot, strict-match. Comes in under the vendor number; methodology/prompting gap.
MMLU-Pro knowledge 25.48 5-shot
5-shot generative CoT, custom-extract. Above the ~10% floor but well below SmolLM3-3B-Base on the identical task.
HumanEval+ code 22.56 29.90 0-shot
0-shot pass@1. Vendor number is on the original (easier) HumanEval, not HumanEval+.
AIME 2024 math 0.00 0-shot
0-shot greedy. Non-termination finding: the base model generates unbounded reasoning without converging on a boxed answer. Failing fraction did not improve from 6k to 16k tokens, ruling out 'needed more tokens'. A real result, not a truncation artifact.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

measured − claimedGSM8K-3.65HumanEval+-7.34
Independent benchmark of the base checkpoint. Knowledge-heavy MMLU-Pro categories hold up best; reasoning-heavy ones are weakest, consistent with a 5.7T-token base model. The AIME 0/60 is a documented non-termination finding, not an artifact (see AIME note).