LLM Benchmark 2026

Measuring
learning,
not memorisation.

RosettaBench evaluates whether LLMs can acquire new knowledge in context — by requiring them to solve AtCoder problems in procedurally generated synthetic programming languages they have never seen before.

16Models Tested
150Problems
88%Top Score
0%LLM Judges
Python source on one stone tablet, synthetic equivalent on another

Python → Synthetic language remapping. Every problem gets a unique vocabulary.

01

Only 2 of 16 models achieve a Learning Tax under 15% — the threshold for genuine in-context learning.

02

PYTHON_LEAK dominates 14 of 16 models — the clearest signal of memorisation over learning.

03

High Python baseline ≠ low Learning Tax. GPT-5.4 Nano scores 70% in Python but just 0.7% on core.

§ 1 — Overview

Why RosettaBench?

F1

Only Gemini 3.1 Pro Preview (7.0%) and Gemini 3 Flash Preview (10.3%) clear the 15% learning threshold. Every other model loses at least 28 percentage points.

F2

PYTHON_LEAK is the dominant failure mode in 14 of 16 models — models don't simply fail, they revert to pre-trained vocabulary.

F3

Hard problems (60) expose the real gap. 11 of 16 models score under 10%, as they demand simultaneous language acquisition and algorithmic reasoning.

F4

No LLM judge is used anywhere. Correctness is determined exclusively by test case pass/fail — making the benchmark fully objective and reproducible.

LLMs are increasingly evaluated on coding and reasoning tasks, yet existing benchmarks primarily measure memorised knowledge — not the ability to learn. Per Google DeepMind's cognitive framework, learning is the ability to acquire new knowledge or skills through experience, study, or instruction. Current evaluations cannot distinguish a model that truly learns from one that merely recalls.

Without benchmarks that isolate learning from recall, progress toward AGI cannot be meaningfully measured.

"Memorisers assume. Learners adapt. Problem abc357_b required uppercase conversion, but the few-shot examples only demonstrated .lower(). Every model except Gemini 3.1 Pro Preview and Gemini 3 Flash Preview hallucinated a synthetic .upper() token — reverting to Python muscle memory. The two exceptions used .lower() with a character map, staying within the demonstrated vocabulary."

— Key finding, RosettaBench core evaluation

RosettaBench addresses this directly. Each problem is presented in a unique synthetic programming language — a fully remapped variant of Python where all keywords, builtins, methods, operators, and delimiters are replaced with invented, pronounceable tokens: def → spimgleox, if → vraith, print → kroumspor.

No two problems share a language, eliminating cross-problem memorisation. The model receives 6 few-shot example pairs showing the synthetic language alongside its Python equivalent, then must produce a correct solution in the synthetic language.

The Learning Tax — defined as baseline score minus core score — quantifies performance loss when familiar Python vocabulary is replaced with a synthetic language. A low learning tax signals genuine in-context language acquisition. A high one signals pre-trained pattern matching.

150 AtCoder problems from LiveCodeBench: 40 easy, 50 medium, 60 hard. Only STDIN/STDOUT problems with ≥ 3 test cases. Sampling uses random_state=42; language generation is fully deterministic given a problem seed.

§ 2 — Methodology

How the benchmark works

Each problem gets its own unique synthetic language, covering all 196 Python vocabulary tokens across 5 categories. Scoring is fully automated — no LLM judge at any stage.

Step 01 · Language Generation
Per-problem vocabulary seeding

A syllable-based RNG seeded per-problem produces structured, pronounceable synthetic tokens. The remapping covers 35 keywords, 77 builtins, 55 methods, 25 operators, and 4 delimiters. No two problems share any token — preventing cross-problem memorisation.

Step 02 · Prompt Construction
6-shot in-context language pairs

The model receives 6 few-shot example pairs (synthetic ↔ Python), then the problem statement. It must produce a <solution> block in the synthetic language. The Python Baseline prompt is identical but requests Python, with no few-shot pairs.

Step 03 · Extraction & Validation
Deterministic back-translation pipeline

Responses are checked for <solution> tags, falling back to the last fenced code block. Extracted code is scanned for Python token leaks, back-translated to Python, compiled, and executed against all test cases with a 30-second timeout.

Step 04 · Scoring
All-or-nothing, no LLM judge

A problem is correct only if all test cases pass. Final score = passed / total ∈ [0, 1]. Learning Tax = baseline − core. No LLM judge is used anywhere — the benchmark is fully objective.

Outcome Codes
PASSAll test cases passed
PYTHON_LEAKPython tokens in synthetic output
SYNTAX_ERRORBack-translated Python won't compile
RUNTIME_ERRORCompiled but crashed at runtime
WRONG_ANSWERRan but produced wrong output
NO_CODENo extractable code in response
§ 3 — Results

Performance across 16 models

Core scores range from 88% to 0.7%. A clear two-tier split exists between models that learn and models that memorise.

Figure 1
Core Score vs Python Baseline

Green = Python Baseline; Blue = Synthetic Core. The gap between them is the Learning Tax. Ranked by Core score descending.

Core vs Baseline
Figure 2
Learning Tax per Model

Lower is better. Green <15% = strong learner. Only two models clear this threshold.

Learning Tax
Figure 3
Per-Difficulty Breakdown (Core)

11 of 16 models score under 10% on hard problems — the most discriminating tier.

Difficulty breakdown
Figure 4
Failure Mode Breakdown

PYTHON_LEAK (red) dominates 14 of 16 models. SYNTAX_ERROR is prominent in Gemini 2.5 Pro and Claude Haiku 4.5, indicating partial but failed language internalisation.

Failure modes
Figure 5
Core Score vs Cost

Bubble = USD cost. Gemini 3 Flash Preview offers the best cost-performance of any competitive model.

Cost vs score
Figure 6
Failure Mode Counts (Raw)

GPT-5.4 Nano: 130 PYTHON_LEAK out of 150 problems — 87% of responses revert to Python.

Failure mode table
§ 4 — Leaderboard

Full Rankings

Ranked by Core score. Learning Tax = Baseline − Core. Green <15% · Orange 15–50% · Red >50%

#ModelCore ↓BaselineLearning TaxCore CostBaseline Cost
1Gemini 3.1 Pro Preview88.0%95.0%7.0%$21.26$13.05
2Gemini 3 Flash Preview80.7%91.0%10.3%$9.20$8.11
3Claude Sonnet 4.644.7%85.0%40.3%$4.20$3.50
4Claude Opus 4.640.7%91.0%50.3%$6.14$4.31
5GPT-5.429.3%75.0%45.7%$1.71$0.90
6Gemini 2.5 Flash26.0%67.0%41.0%$3.96$4.35
7Claude Haiku 4.525.3%79.0%53.7%$1.06$0.72
8Gemini 2.5 Pro24.0%78.0%54.0%$17.58$23.79
9DeepSeek R115.3%77.0%61.7%$11.64$10.57
10DeepSeek V315.3%74.0%58.7%$0.31$0.28
11Qwen3 Coder 480B13.3%69.0%55.7%$0.17$0.20
12Gemini 3.1 Flash Lite Preview8.7%80.0%71.3%$0.28$0.15
13Qwen3 235B8.0%68.0%60.0%$0.13$0.23
14GPT-5.4 Mini4.7%63.0%58.3%$0.50$0.32
15Gemma 3 27B3.3%32.0%28.7%$0.03$0.01
16GPT-5.4 Nano0.7%70.0%69.3%$0.15$0.16
§ 5 — Insights

What the data reveals

★ Central Finding
Memorisers assume. Learners adapt.

Problem abc357_b required uppercase conversion — but the few-shot examples only demonstrated .lower(). Every model except Gemini 3.1 Pro Preview and Gemini 3 Flash Preview hallucinated a synthetic .upper() token, reverting to Python muscle memory rather than reasoning from demonstrated vocabulary. The two exceptions used .lower() with a character map — adapting within the given language instead of assuming beyond it. This wasn't an algorithmic failure. It was the inability to suppress memorised patterns in favour of what was actually taught in context.

Insight 01
A genuine two-tier split

Gemini 3.1 Pro Preview (7.0% tax) and Gemini 3 Flash Preview (10.3%) are in a class of their own. Every other model loses at least 28 percentage points — most exceed 50%. The gap is not a continuum; it is a discontinuity between models that learn and models that memorise.

Insight 02
Baseline score is misleading

High Python performance does not imply learning ability. GPT-5.4 Nano: 70% on Python, 0.7% on core. Gemini 3.1 Flash Lite: 80% baseline, 8.7% core — a 71.3% tax. Existing benchmarks systematically overrate models that are merely pattern-matching.

Insight 03
Hard problems are the discriminator

11 of 16 models score under 10% on hard problems. Hard tasks demand simultaneous language acquisition and algorithmic reasoning — a double bind that exposes the true boundary between cognition and recall.

Insight 04
PYTHON_LEAK reveals the mechanism

Models don't simply fail — they revert. PYTHON_LEAK's dominance shows that pre-trained token distributions overpower in-context instruction. SYNTAX_ERROR in Gemini 2.5 Pro and Claude Haiku 4.5 tells a different story: partial acquisition without grammatical internalisation.