when a benchmark number looks wrong, the instinct is to assume the agent regressed.
that is not what happened here.
we had a public xbow story that said:
- 91 / 104 black-box
- 96 / 104 best-of-n aggregate
and then we wrote a consolidator over the retained github artifacts and got 22 black-box.
the bug was not the benchmark. the bug was the scoreboard.
the first bad number
the initial consolidator only walked workflow runs whose overall
conclusion was success.
that sounds reasonable until you remember how these benchmark workflows actually behave:
- a long run can fail late
- a repeat sweep can hit the wall-clock limit
- github can still upload the
xbow-results-*artifact after that
so we were throwing away perfectly good benchmark evidence just because the parent workflow finished red.
that is how you get a fake low number like 22 black-box and scare yourself into thinking the engine forgot how to pentest.
the fix
we changed the consolidator to stop treating workflow conclusion as the same thing as evidence availability.
instead it now:
- scopes by completed xbow workflow runs
- walks retained
xbow-results-*artifacts directly - includes runs that finished
failureif the artifact exists - unions solved challenge ids into black-box, white-box, and aggregate sets
that one shift moved the retained artifact-backed tally from nonsense to something credible:
- 74 / 104 black-box
- 79 / 104 white-box
- 99 / 104 aggregate
not because the agent got better overnight, but because the evidence surface got more honest.
what the two truths are
there are now two benchmark truths in the repo, and pretending there is only one is what caused the confusion.
1. retained artifact-backed truth
this is the strongest thing we can prove from machine-recoverable github artifacts today.
current xbow retained tally:
- 74 / 104 black-box
- 77 / 104 white-box
- 97 / 104 aggregate
this is the number you get if you ask:
what can we reproduce right now from the artifacts that still exist?
2. historical mixed local+ci publication
this is the older public line:
- 91 / 104 black-box
- 96 / 104 aggregate
it is not necessarily false. it is just not fully reconstructible from the retained artifact window we have today.
that means it belongs in the repo as historical publication, not as the only current source of truth.
the exact mismatch
once we diffed the retained aggregate against the public 96-set, the problem stopped being abstract.
the docs-only solved ids were:
XBEN-045XBEN-053XBEN-080XBEN-082
the artifact-only solved ids were:
XBEN-054XBEN-099
then we started dispatching targeted recovery runs instead of rerunning the whole suite blindly.
that recovered machine-backed solves for:
XBEN-053XBEN-080XBEN-079XBEN-082XBEN-034
which shrank the docs-only gap to just:
XBEN-045
that is the kind of progress you only get when the benchmark becomes a forensics problem instead of a vibes problem.
what changed in the docs
the repo now has a dedicated benchmark ledger:
packages/benchmark/results/benchmark-ledger.json
and the benchmark page explicitly distinguishes:
- retained artifact-backed tally
- historical mixed local+ci tally
that matters because secondary pages were freezing score claims in too many places:
- readme
- benchmark docs
- roadmap
- research pages
- marketing citations
if five pages each own a scoreboard, you do not have five sources of truth. you have zero.
the rule now is simple:
- ledger owns the machine-readable truth
- benchmark page owns the human-readable explanation
- everything else summarizes or links
what we learned
the interesting lesson is not “best-of-n numbers can be gamed.” everybody already knows that.
the interesting lesson is:
benchmark evidence rots too.
artifacts expire. workflow conclusions hide useful results. docs keep old numbers alive after the machine-readable trace has moved.
if you are going to market with benchmark scores, you need to version the scoreboard with the same discipline you version the code.
otherwise one day you wake up, rerun the consolidator, and discover the benchmark was never your weakest link.
the bookkeeping was.
where this leaves pwnkit
the current retained artifact-backed xbow aggregate is 99 / 104. that is the strongest number we can prove from the artifacts we have today.
the historical public line is still 91 / 104 black-box and 96 / 104 aggregate, but it is now explicitly labeled as historical, mixed local+ci publication rather than silently presented as the only current truth.
that is a stronger position, not a weaker one.
the project did not lose a point. it gained an evidence model.