We measured it on repos
you already trust
8 public TypeScript repositories. 40 real coding tasks. The merged product CLI. Here's exactly what we found — including the two repos that didn't hit our target.
Run date: 2026-05-05 · CLI: v0.10.17 · PR #564 (merged)
What we can say, precisely
"In public TypeScript/JavaScript repo benchmarks, CodeLedger's Context-Compiler reduced context tokens by a weighted average of 28.7% while preserving 100% top-ranked file stability and producing zero omission incidents."
We are not claiming it works on all languages, that it's safe to enable by default everywhere, or that it guarantees 30% savings. Those claims are not supported by this run. The ones above are.
Methodology
How we ran this
Four decisions shaped the benchmark. We're sharing them because reproducibility is the product.
We had a real PR to work with
PR #564 merged the context-quality reporting loop into the CodeLedger CLI. Before this, any benchmark used a standalone pre-release runner. The merged command is the product — so this is the first run where the evidence is trustworthy.
Command used in every run:
codeledger bench context-quality --repo /path/to/repo --tasks "task1|task2|..." --format json
We picked repos we couldn't control
Eight major public repos on GitHub, biased toward TypeScript because our context compiler's trim rules are tuned for TS/JS monorepos. Shallow clones only — no cherry-picked branches, no prepared state.
Repos selected before running any benchmarks.
vercel/next.js · prisma/prisma · remix-run/react-router · supabase/supabase nestjs/nest · vitejs/vite · storybookjs/storybook · vuejs/core
We used five realistic coding tasks per repo
Same task wording across most repos. For NestJS and Supabase we adapted to their actual domain (guards, RLS policies, edge functions) so the tasks would surface real code, not dead ends.
Generic task set:
fix API route error handling update authentication flow validation refactor configuration loading debug test failure in package resolution improve build pipeline cache behavior
We recorded everything, including the WATCH results
react-router came in at 23% — just below the 25% threshold. Supabase came in at 9.1%. We did not adjust weights to make the numbers look better. Both repos are safe: zero omission risk, 100% top-5 stable.
The WATCH verdict means: safe, but the feature flag should stay on.
Results
Per-repo results
All 8 repos had 100% top-5 file stability. The bar shows token reduction; the threshold is 25%.
About the WATCH verdicts
react-router (23%) — narrowly below threshold. Most selected files are integration tests and markdown docs that don't have much trim surface. Top-5 stable. No omission risk. The fix is symbol-slice support (not yet in V1).
supabase (9.1%) — expected behavior, not a bug. Supabase is a mixed monorepo: half its top-selected files are short .mdx troubleshooting guides. There's nothing meaningful to trim without losing information. Safety retention did its job.
Under the hood
How the compiler spent its operations
Every file that passes through the context compiler gets tagged with an operation. These are the aggregate counts across all 40 tasks and 8 repos.
The most important number here is the last one: prune = 0. The V1 compiler never removes a file from the selected bundle. It only shortens. This is a deliberate safety constraint — we'd rather leave tokens on the table than silently drop a file your agent needed.
Files kept at full weight — high-coupling, high-churn, safety-protected nodes
Files whose excerpts were shortened — still in bundle, just compressed
Guidance files surfaced to bundle head — aids agent orientation
Files already below minimum viable length — no trimming needed
Files removed from baseline selection — none, by design in V1
Interpretation
What this means for your team
It sees roughly the same files, in the same ranked order — but the content is tighter. A bundle that used to be 8,000 tokens arrives at ~5,700. That's more room for your task description and less noise from boilerplate.
No files disappear from the bundle. The optimizer only shortens excerpts — it never drops a file the selector chose. The top-5 file ranking is identical before and after compilation across every task we measured.
This benchmark used the merged product command against public repos with pre-registered task sets. The numbers weren't produced by an internal fixture — they reflect real repo structure, real churn patterns, and real dependency graphs.
Supabase shows 9.1% savings. That's not a bug — it's correct behavior. Most of its selected files are short .mdx docs and SQL migrations that don't compress well. The compiler is being conservative, which is the right call.
Our current posture
What we're doing with this data
One benchmark run is not a default rollout. Here's what we decided, and why.
Feature stays behind the flag
Enabled on opt-in for TS/JS-primary repos. Default stays off. One run is not enough to flip a default that affects every user.
Second benchmark pass
Same repos, same tasks. If results are stable within ±3%, we'll expand the set to 10+ repos and move toward opt-in announcement.
10+ repos, stable × 2
We will not enable by default until the benchmark set reaches 10+ repos and two consecutive runs produce consistent results.
What we won't claim from this run
Reproducibility
The full artifact set
Every JSON report from this benchmark run is committed to the CodeLedger repo under .codeledger/reports/public-context-quality/. The aggregate report covers all 8 repos and 40 tasks in a single file.
See your own token reduction number
The benchmark ran in under 10 minutes per repo. It takes the same time on yours. Start the trial, run codeledger bench context-quality, and compare.
No credit card · Free tier always available · Opt-in context compiler