Blog / / 8 min read

the scoreboard was the bug

we thought we were debugging benchmark score drift. we were actually debugging evidence drift. the retained artifact-backed XBOW tally is now 99/104, the historical public line is 96/104, and the real work turned out to be separating those two truths.

when a benchmark number looks wrong, the instinct is to assume the agent regressed.

that is not what happened here.

we had a public xbow story that said:

  • 91 / 104 black-box
  • 96 / 104 best-of-n aggregate

and then we wrote a consolidator over the retained github artifacts and got 22 black-box.

the bug was not the benchmark. the bug was the scoreboard.

the first bad number

the initial consolidator only walked workflow runs whose overall conclusion was success.

that sounds reasonable until you remember how these benchmark workflows actually behave:

  • a long run can fail late
  • a repeat sweep can hit the wall-clock limit
  • github can still upload the xbow-results-* artifact after that

so we were throwing away perfectly good benchmark evidence just because the parent workflow finished red.

that is how you get a fake low number like 22 black-box and scare yourself into thinking the engine forgot how to pentest.

the fix

we changed the consolidator to stop treating workflow conclusion as the same thing as evidence availability.

instead it now:

  1. scopes by completed xbow workflow runs
  2. walks retained xbow-results-* artifacts directly
  3. includes runs that finished failure if the artifact exists
  4. unions solved challenge ids into black-box, white-box, and aggregate sets

that one shift moved the retained artifact-backed tally from nonsense to something credible:

  • 74 / 104 black-box
  • 79 / 104 white-box
  • 99 / 104 aggregate

not because the agent got better overnight, but because the evidence surface got more honest.

what the two truths are

there are now two benchmark truths in the repo, and pretending there is only one is what caused the confusion.

1. retained artifact-backed truth

this is the strongest thing we can prove from machine-recoverable github artifacts today.

current xbow retained tally:

  • 74 / 104 black-box
  • 77 / 104 white-box
  • 97 / 104 aggregate

this is the number you get if you ask:

what can we reproduce right now from the artifacts that still exist?

2. historical mixed local+ci publication

this is the older public line:

  • 91 / 104 black-box
  • 96 / 104 aggregate

it is not necessarily false. it is just not fully reconstructible from the retained artifact window we have today.

that means it belongs in the repo as historical publication, not as the only current source of truth.

the exact mismatch

once we diffed the retained aggregate against the public 96-set, the problem stopped being abstract.

the docs-only solved ids were:

  • XBEN-045
  • XBEN-053
  • XBEN-080
  • XBEN-082

the artifact-only solved ids were:

  • XBEN-054
  • XBEN-099

then we started dispatching targeted recovery runs instead of rerunning the whole suite blindly.

that recovered machine-backed solves for:

  • XBEN-053
  • XBEN-080
  • XBEN-079
  • XBEN-082
  • XBEN-034

which shrank the docs-only gap to just:

  • XBEN-045

that is the kind of progress you only get when the benchmark becomes a forensics problem instead of a vibes problem.

what changed in the docs

the repo now has a dedicated benchmark ledger:

  • packages/benchmark/results/benchmark-ledger.json

and the benchmark page explicitly distinguishes:

  • retained artifact-backed tally
  • historical mixed local+ci tally

that matters because secondary pages were freezing score claims in too many places:

  • readme
  • benchmark docs
  • roadmap
  • research pages
  • marketing citations

if five pages each own a scoreboard, you do not have five sources of truth. you have zero.

the rule now is simple:

  1. ledger owns the machine-readable truth
  2. benchmark page owns the human-readable explanation
  3. everything else summarizes or links

what we learned

the interesting lesson is not “best-of-n numbers can be gamed.” everybody already knows that.

the interesting lesson is:

benchmark evidence rots too.

artifacts expire. workflow conclusions hide useful results. docs keep old numbers alive after the machine-readable trace has moved.

if you are going to market with benchmark scores, you need to version the scoreboard with the same discipline you version the code.

otherwise one day you wake up, rerun the consolidator, and discover the benchmark was never your weakest link.

the bookkeeping was.

where this leaves pwnkit

the current retained artifact-backed xbow aggregate is 99 / 104. that is the strongest number we can prove from the artifacts we have today.

the historical public line is still 91 / 104 black-box and 96 / 104 aggregate, but it is now explicitly labeled as historical, mixed local+ci publication rather than silently presented as the only current truth.

that is a stronger position, not a weaker one.

the project did not lose a point. it gained an evidence model.