Context-Compiler · Public Benchmark v1

We measured it on repos
you already trust

8 public TypeScript repositories. 40 real coding tasks. The merged product CLI. Here's exactly what we found — including the two repos that didn't hit our target.

Run date: 2026-05-05  ·  CLI: v0.10.17  ·  PR #564 (merged)

28.7%
weighted avg token reduction
across 250,942 baseline tokens
100%
top-5 file stability
across all 40 tasks
0
omission incidents
zero files silently dropped
8 repos
public TypeScript repos
40 tasks, 1 merged CLI command

What we can say, precisely

"In public TypeScript/JavaScript repo benchmarks, CodeLedger's Context-Compiler reduced context tokens by a weighted average of 28.7% while preserving 100% top-ranked file stability and producing zero omission incidents."

We are not claiming it works on all languages, that it's safe to enable by default everywhere, or that it guarantees 30% savings. Those claims are not supported by this run. The ones above are.

Methodology

How we ran this

Four decisions shaped the benchmark. We're sharing them because reproducibility is the product.

01

We had a real PR to work with

PR #564 merged the context-quality reporting loop into the CodeLedger CLI. Before this, any benchmark used a standalone pre-release runner. The merged command is the product — so this is the first run where the evidence is trustworthy.

Command used in every run:

codeledger bench context-quality --repo /path/to/repo --tasks "task1|task2|..." --format json
02

We picked repos we couldn't control

Eight major public repos on GitHub, biased toward TypeScript because our context compiler's trim rules are tuned for TS/JS monorepos. Shallow clones only — no cherry-picked branches, no prepared state.

Repos selected before running any benchmarks.

vercel/next.js · prisma/prisma · remix-run/react-router · supabase/supabase
nestjs/nest · vitejs/vite · storybookjs/storybook · vuejs/core
03

We used five realistic coding tasks per repo

Same task wording across most repos. For NestJS and Supabase we adapted to their actual domain (guards, RLS policies, edge functions) so the tasks would surface real code, not dead ends.

Generic task set:

fix API route error handling
update authentication flow validation
refactor configuration loading
debug test failure in package resolution
improve build pipeline cache behavior
04

We recorded everything, including the WATCH results

react-router came in at 23% — just below the 25% threshold. Supabase came in at 9.1%. We did not adjust weights to make the numbers look better. Both repos are safe: zero omission risk, 100% top-5 stable.

The WATCH verdict means: safe, but the feature flag should stay on.

Results

Per-repo results

All 8 repos had 100% top-5 file stability. The bar shows token reduction; the threshold is 25%.

RepositoryFiles indexedToken reduction
vitejs/vite
2,518
45.8%PASS
nestjs/nest
2,092
37.4%PASS
vercel/next.js
27,144
33%PASS
vuejs/core
677
28.1%PASS
storybookjs/storybook
5,381
27.7%PASS
prisma/prisma
4,506
25.3%PASS
remix-run/react-router
1,461
23%WATCH
supabase/supabase
10,874
9.1%WATCH
Aggregate40 tasks · 250,942 baseline tokens
28.7%PASS

About the WATCH verdicts

react-router (23%) — narrowly below threshold. Most selected files are integration tests and markdown docs that don't have much trim surface. Top-5 stable. No omission risk. The fix is symbol-slice support (not yet in V1).

supabase (9.1%) — expected behavior, not a bug. Supabase is a mixed monorepo: half its top-selected files are short .mdx troubleshooting guides. There's nothing meaningful to trim without losing information. Safety retention did its job.

Under the hood

How the compiler spent its operations

Every file that passes through the context compiler gets tagged with an operation. These are the aggregate counts across all 40 tasks and 8 repos.

The most important number here is the last one: prune = 0. The V1 compiler never removes a file from the selected bundle. It only shortens. This is a deliberate safety constraint — we'd rather leave tokens on the table than silently drop a file your agent needed.

711
retain

Files kept at full weight — high-coupling, high-churn, safety-protected nodes

87
trim

Files whose excerpts were shortened — still in bundle, just compressed

34
hoist

Guidance files surfaced to bundle head — aids agent orientation

88
skip

Files already below minimum viable length — no trimming needed

0
prune

Files removed from baseline selection — none, by design in V1

Interpretation

What this means for your team

For your AI coding agent

It sees roughly the same files, in the same ranked order — but the content is tighter. A bundle that used to be 8,000 tokens arrives at ~5,700. That's more room for your task description and less noise from boilerplate.

🛡For your engineering team

No files disappear from the bundle. The optimizer only shortens excerpts — it never drops a file the selector chose. The top-5 file ranking is identical before and after compilation across every task we measured.

📋For your CTO or platform team

This benchmark used the merged product command against public repos with pre-registered task sets. The numbers weren't produced by an internal fixture — they reflect real repo structure, real churn patterns, and real dependency graphs.

⚠️For doc-heavy or multi-language monorepos

Supabase shows 9.1% savings. That's not a bug — it's correct behavior. Most of its selected files are short .mdx docs and SQL migrations that don't compress well. The compiler is being conservative, which is the right call.

Our current posture

What we're doing with this data

One benchmark run is not a default rollout. Here's what we decided, and why.

Right now

Feature stays behind the flag

Enabled on opt-in for TS/JS-primary repos. Default stays off. One run is not enough to flip a default that affects every user.

In 4 weeks

Second benchmark pass

Same repos, same tasks. If results are stable within ±3%, we'll expand the set to 10+ repos and move toward opt-in announcement.

The gate

10+ repos, stable × 2

We will not enable by default until the benchmark set reaches 10+ repos and two consecutive runs produce consistent results.

What we won't claim from this run

"Works across all languages"
"Safe to enable by default everywhere"
"DCE removes dead files effectively"
"Guaranteed 30% savings"

Reproducibility

The full artifact set

Every JSON report from this benchmark run is committed to the CodeLedger repo under .codeledger/reports/public-context-quality/. The aggregate report covers all 8 repos and 40 tasks in a single file.

.codeledger/reports/public-context-quality/
next.js/report.json5 tasks · 33.0% · PASS
prisma/report.json5 tasks · 25.3% · PASS
react-router/report.json5 tasks · 23.0% · WATCH
supabase/report.json5 tasks · 13.2% · WATCH
nest/report.json5 tasks · 26.4% · PASS
vite/report.json5 tasks · 44.7% · PASS
storybook/report.json5 tasks · 27.5% · PASS
vue-core/report.json5 tasks · 31.8% · PASS
aggregate/report.json40 tasks · 28.7% · PASS (schema: context-quality-report/v1)
Try it on your repo

See your own token reduction number

The benchmark ran in under 10 minutes per repo. It takes the same time on yours. Start the trial, run codeledger bench context-quality, and compare.

No credit card · Free tier always available · Opt-in context compiler