RosettaBench — Measuring Learning, Not Memorisation

§ 1 — Overview

Why RosettaBench?

Only Gemini 3.1 Pro Preview (7.0%) and Gemini 3 Flash Preview (10.3%) clear the 15% learning threshold. Every other model loses at least 28 percentage points.

PYTHON_LEAK is the dominant failure mode in 14 of 16 models — models don't simply fail, they revert to pre-trained vocabulary.

Hard problems (60) expose the real gap. 11 of 16 models score under 10%, as they demand simultaneous language acquisition and algorithmic reasoning.

No LLM judge is used anywhere. Correctness is determined exclusively by test case pass/fail — making the benchmark fully objective and reproducible.

LLMs are increasingly evaluated on coding and reasoning tasks, yet existing benchmarks primarily measure memorised knowledge — not the ability to learn. Per Google DeepMind's cognitive framework, learning is the ability to acquire new knowledge or skills through experience, study, or instruction. Current evaluations cannot distinguish a model that truly learns from one that merely recalls.

Without benchmarks that isolate learning from recall, progress toward AGI cannot be meaningfully measured.

"Memorisers assume. Learners adapt. Problem abc357_b required uppercase conversion, but the few-shot examples only demonstrated .lower(). Every model except Gemini 3.1 Pro Preview and Gemini 3 Flash Preview hallucinated a synthetic .upper() token — reverting to Python muscle memory. The two exceptions used .lower() with a character map, staying within the demonstrated vocabulary."

— Key finding, RosettaBench core evaluation

RosettaBench addresses this directly. Each problem is presented in a unique synthetic programming language — a fully remapped variant of Python where all keywords, builtins, methods, operators, and delimiters are replaced with invented, pronounceable tokens: def → spimgleox, if → vraith, print → kroumspor.

No two problems share a language, eliminating cross-problem memorisation. The model receives 6 few-shot example pairs showing the synthetic language alongside its Python equivalent, then must produce a correct solution in the synthetic language.

The Learning Tax — defined as baseline score minus core score — quantifies performance loss when familiar Python vocabulary is replaced with a synthetic language. A low learning tax signals genuine in-context language acquisition. A high one signals pre-trained pattern matching.

150 AtCoder problems from LiveCodeBench: 40 easy, 50 medium, 60 hard. Only STDIN/STDOUT problems with ≥ 3 test cases. Sampling uses random_state=42; language generation is fully deterministic given a problem seed.

§ 2 — Methodology

How the benchmark works

Each problem gets its own unique synthetic language, covering all 196 Python vocabulary tokens across 5 categories. Scoring is fully automated — no LLM judge at any stage.

Step 01 · Language Generation

Per-problem vocabulary seeding

A syllable-based RNG seeded per-problem produces structured, pronounceable synthetic tokens. The remapping covers 35 keywords, 77 builtins, 55 methods, 25 operators, and 4 delimiters. No two problems share any token — preventing cross-problem memorisation.

Step 02 · Prompt Construction

6-shot in-context language pairs

The model receives 6 few-shot example pairs (synthetic ↔ Python), then the problem statement. It must produce a <solution> block in the synthetic language. The Python Baseline prompt is identical but requests Python, with no few-shot pairs.

Step 03 · Extraction & Validation

Deterministic back-translation pipeline

Responses are checked for <solution> tags, falling back to the last fenced code block. Extracted code is scanned for Python token leaks, back-translated to Python, compiled, and executed against all test cases with a 30-second timeout.

Step 04 · Scoring

All-or-nothing, no LLM judge

A problem is correct only if all test cases pass. Final score = passed / total ∈ [0, 1]. Learning Tax = baseline − core. No LLM judge is used anywhere — the benchmark is fully objective.

Outcome Codes

PASSAll test cases passed

PYTHON_LEAKPython tokens in synthetic output

SYNTAX_ERRORBack-translated Python won't compile

RUNTIME_ERRORCompiled but crashed at runtime

WRONG_ANSWERRan but produced wrong output

NO_CODENo extractable code in response

§ 3 — Results

Performance across 16 models

Core scores range from 88% to 0.7%. A clear two-tier split exists between models that learn and models that memorise.

Figure 1

Core Score vs Python Baseline

Green = Python Baseline; Blue = Synthetic Core. The gap between them is the Learning Tax. Ranked by Core score descending.

Figure 2

Learning Tax per Model

Lower is better. Green <15% = strong learner. Only two models clear this threshold.

Figure 3

Per-Difficulty Breakdown (Core)

11 of 16 models score under 10% on hard problems — the most discriminating tier.

Figure 4

Failure Mode Breakdown

PYTHON_LEAK (red) dominates 14 of 16 models. SYNTAX_ERROR is prominent in Gemini 2.5 Pro and Claude Haiku 4.5, indicating partial but failed language internalisation.

Figure 5

Core Score vs Cost

Bubble = USD cost. Gemini 3 Flash Preview offers the best cost-performance of any competitive model.

Figure 6

Failure Mode Counts (Raw)

GPT-5.4 Nano: 130 PYTHON_LEAK out of 150 problems — 87% of responses revert to Python.

§ 4 — Leaderboard

Full Rankings

Ranked by Core score. Learning Tax = Baseline − Core. Green <15% · Orange 15–50% · Red >50%

#	Model	Core ↓	Baseline	Learning Tax	Core Cost	Baseline Cost
1	Gemini 3.1 Pro Preview	88.0%	95.0%	7.0%	$21.26	$13.05
2	Gemini 3 Flash Preview	80.7%	91.0%	10.3%	$9.20	$8.11
3	Claude Sonnet 4.6	44.7%	85.0%	40.3%	$4.20	$3.50
4	Claude Opus 4.6	40.7%	91.0%	50.3%	$6.14	$4.31
5	GPT-5.4	29.3%	75.0%	45.7%	$1.71	$0.90
6	Gemini 2.5 Flash	26.0%	67.0%	41.0%	$3.96	$4.35
7	Claude Haiku 4.5	25.3%	79.0%	53.7%	$1.06	$0.72
8	Gemini 2.5 Pro	24.0%	78.0%	54.0%	$17.58	$23.79
9	DeepSeek R1	15.3%	77.0%	61.7%	$11.64	$10.57
10	DeepSeek V3	15.3%	74.0%	58.7%	$0.31	$0.28
11	Qwen3 Coder 480B	13.3%	69.0%	55.7%	$0.17	$0.20
12	Gemini 3.1 Flash Lite Preview	8.7%	80.0%	71.3%	$0.28	$0.15
13	Qwen3 235B	8.0%	68.0%	60.0%	$0.13	$0.23
14	GPT-5.4 Mini	4.7%	63.0%	58.3%	$0.50	$0.32
15	Gemma 3 27B	3.3%	32.0%	28.7%	$0.03	$0.01
16	GPT-5.4 Nano	0.7%	70.0%	69.3%	$0.15	$0.16

§ 5 — Insights

What the data reveals

★ Central Finding

Memorisers assume. Learners adapt.

Problem abc357_b required uppercase conversion — but the few-shot examples only demonstrated .lower(). Every model except Gemini 3.1 Pro Preview and Gemini 3 Flash Preview hallucinated a synthetic .upper() token, reverting to Python muscle memory rather than reasoning from demonstrated vocabulary. The two exceptions used .lower() with a character map — adapting within the given language instead of assuming beyond it. This wasn't an algorithmic failure. It was the inability to suppress memorised patterns in favour of what was actually taught in context.

Insight 01

A genuine two-tier split

Gemini 3.1 Pro Preview (7.0% tax) and Gemini 3 Flash Preview (10.3%) are in a class of their own. Every other model loses at least 28 percentage points — most exceed 50%. The gap is not a continuum; it is a discontinuity between models that learn and models that memorise.

Insight 02

Baseline score is misleading

High Python performance does not imply learning ability. GPT-5.4 Nano: 70% on Python, 0.7% on core. Gemini 3.1 Flash Lite: 80% baseline, 8.7% core — a 71.3% tax. Existing benchmarks systematically overrate models that are merely pattern-matching.

Insight 03

Hard problems are the discriminator

11 of 16 models score under 10% on hard problems. Hard tasks demand simultaneous language acquisition and algorithmic reasoning — a double bind that exposes the true boundary between cognition and recall.

Insight 04

PYTHON_LEAK reveals the mechanism

Models don't simply fail — they revert. PYTHON_LEAK's dominance shows that pre-trained token distributions overpower in-context instruction. SYNTAX_ERROR in Gemini 2.5 Pro and Claude Haiku 4.5 tells a different story: partial acquisition without grammatical internalisation.

Measuring
learning,
not memorisation.

Why RosettaBench?

How the benchmark works

Performance across 16 models

Full Rankings

What the data reveals

Access the benchmark