Update — 2026-04-12. Batch 2 landed overnight. The original batch 1 framing — “moat costs 2 flags on white-box” and “stable features cause npm-bench FPR” — did not replicate. Removing the one broken layer (egats) closed the flag gap to noise. The npm-bench FPR swing was a 2-package flip on 27 safe packages. Corrected numbers inline below; original framing preserved so the trail stays honest.
v0.6.0 shipped on April 6 with an 11-layer triage pipeline we called “the moat.” the marketing claim was “FPR down from ~50% to under 5%, matching Endor Labs’ 95% and Semgrep Assistant’s 96%.” we pointed at it when people asked what our moat was. it was the literal name of the directory.
then we ran the ablation.
the prior data that was wrong twice
before the ablation we had two internal data points.
npm-bench F1 = 0.444. filed as a “recall problem.” cited in internal docs for a week. the score was from a 30-package slice. the current 81-package set scores F1 = 0.973 with 100% recall across every profile. the recall problem does not exist on the live test set.
stubborn-14 ablation: 4 flags baseline → 0 flags with moat. we almost turned the moat off based on this. stubborn-14 is the 14 hardest challenges. n=14 with LLM non-determinism. each flag is a 7.1pp swing. the limit=50 result told the opposite story. stubborn-slice evaluations lie — they diagnose failure modes, they don’t measure whether to ship.
21 runs, $300, 6 hours
four profiles. two modes. five npm-bench configs. seven single-feature isolations. everything feature-flagged and reproducible via gh workflow run xbow-bench.yml -f features=<profile>.
profiles: none (zero triage), no-triage (stable features on, triage off), moat-only (triage on, stable off), moat (everything).
xbow white-box, limit=50
| profile | flags | findings | cost | $/flag |
|---|---|---|---|---|
| none | 43/50 | 67 | $14 | $0.33 |
| no-triage | 44/50 | 67 | $17 | $0.39 |
| moat-only | 41/50 | 25 | $27 | $0.66 |
| moat | 41/50 | 25 | $22 | $0.53 |
the moat cuts findings 63% (67 → 25) and costs 2 flags (44 → 41). that is a pareto tradeoff. whether it is a good one depends on whether the downstream user wants fewer findings or more flags.
Batch 2 correction. After removing egats (see below), these numbers became 42/50 on both moat profiles at $16/run. all four profiles are 42–44. the 2-flag gap was egats-specific. with egats off, the moat is effectively free precision.
xbow black-box, limit=25
| profile | flags | findings | cost | $/flag |
|---|---|---|---|---|
| none | 18/25 | 27 | $14 | $0.76 |
| no-triage | 19/25 | 34 | $10 | $0.55 |
| moat-only | 18/25 | 13 | $11 | $0.62 |
| moat | 19/25 | 14 | $10 | $0.53 |
strict pareto dominance. more flags, fewer findings, cheaper per flag. the moat works in black-box. the hypothesis: when the agent has no source code, the triage layers add value by re-checking noisy external-signal findings. when it has source code, it generates high-confidence exploits and the triage layers second-guess a confident agent.
npm-bench, 81 packages
| profile | F1 | TPR | FPR | safe correct |
|---|---|---|---|---|
| none | 0.973 | 1.00 | 0.11 | 24/27 |
| no-triage | 0.964 | 1.00 | 0.15 | 23/27 |
| moat-only | 0.964 | 1.00 | 0.15 | 23/27 |
| moat | 0.956 | 1.00 | 0.19 | 22/27 |
| default | 0.956 | 1.00 | 0.19 | 22/27 |
moat and default are identical. the 11-layer triage stack contributes nothing to FPR on npm-bench. the FPR increase from none (0.11) to default (0.19) comes from the stable features — early-stop, script templates, progress handoff — making the agent more productive on safe packages.
Batch 2 correction. The 0.19 did not replicate. batch 2’s
defaultgot 0.11. the swing is ±2 packages on 27 safe samples. needrepeat=3before drawing FPR conclusions at this sample size.
100% recall on every profile. every malicious package, every vulnerable package caught.
the one broken layer
seven profiles against the stubborn-14, each adding exactly one moat layer to default:
| layer | flags | delta | $/flag |
|---|---|---|---|
| default | 2/14 | — | $3.62 |
| +pov | 4/14 | +2 | $2.39 |
| +reachability | 5/14 | +3 | $1.61 |
| +multimodal | 3/14 | +1 | $2.52 |
| +debate | 5/14 | +3 | $2.65 |
| +memories | 4/14 | +2 | $3.35 |
| +egats | 1/14 | −1 | $15.93 |
| +consensus | 3/14 | +1 | $2.67 |
six of seven help. egats loses a flag and costs 10x the next-worst layer. when the full moat runs, egats prunes exploration branches the other layers would have used. the interaction is multiplicatively destructive on hard challenges.
this is the full explanation of the original “moat catastrophically regresses” finding. it was not the moat. it was one layer of the moat. we disabled egats in the default profile the same afternoon. the code stays in the tree, the default is off.
reachability is the standout: +3 flags at $1.61 each. best cost-per-flag of any layer.
batch 2: the correction of the correction
overnight we re-ran the full white-box matrix against the commit that disabled egats.
| profile | batch 1 | batch 2 | delta |
|---|---|---|---|
| none | 43/50 | 44/50 | +1 |
| no-triage | 44/50 | 43/50 | −1 |
| moat-only | 41/50 | 42/50 | +1 |
| moat | 41/50 | 42/50 | +1 |
removing egats: +1 flag, −25% cost on the moat profiles. all four are 42–44. the gap is noise.
we also ran single-feature isolation on npm-bench stable features:
| profile | F1 | TPR | FPR |
|---|---|---|---|
| default (batch 2) | 0.973 | 1.00 | 0.11 |
| no-script-templates | 0.964 | 0.98 | 0.11 |
| no-handoff | 0.973 | 1.00 | 0.11 |
the batch 1 “stable features cause FPR” finding did not replicate. batch 2 default matched batch 1 none. the 0.19 was a 2-package noise swing. no-script-templates loses a detection — script templates help recall. do not disable them.
a finding that does not replicate 12 hours later on the same code is noise, not signal. 27 safe packages is not enough for single-run FPR conclusions.
overnight cross-benchmark
while the ablation was running, benchmark suites across five other domains completed:
| benchmark | score | notes |
|---|---|---|
| picoctf | 8/10 (80%) | client-side, robots.txt, login bypass, sqli |
| hackbench | 3/5 (60%) | subset of the 16-challenge suite |
| argus | 2/5 (40%) | multi-step APT scenarios |
| portswigger | 0/10 | zero findings — likely expired lab session, not agent limitation |
| bountybench | 0/3 | different attack domain (patch/detect/exploit) |
portswigger at 0/10 with zero findings on basic sqli and reflected xss is suspicious in a “the session expired” way, not a “the agent can’t do xss” way. investigating.
what replicated across both batches
- egats is the one broken layer. removing it improved moat +1 flag, −25% cost.
- the moat cuts findings ~60%. batch 1: 67→25. batch 2: 72→27.
- 100% recall on npm-bench. every profile, both batches.
- black-box moat strictly dominates. 37/50 at limit=50.
what did not replicate
“the moat costs 2 flags on white-box”— after egats, the gap is 0–2. noise.“stable features cause npm-bench FPR”— batch 2 default got 0.11, matching batch 1 none.
what shipped in 48 hours
- per-finding layer telemetry. every finding logs which triage layer touched it, what verdict, the cost, the duration. commit
6f1a889. - npm-bench feature-flag support. mirrors xbow-bench. enabled the ablation matrix.
triage-dataset-v2.jsonl. 1514 labeled rows from 32 results files. 163 with per-layer verdicts.- egats disabled in all moat profiles. commit
aadcf32. - stable-feature isolation tokens.
no-script-templates,no-handoff,no-early-stop. fp-reduction-moat.mdrewritten with measured numbers. seven other docs updated.- dynamic routing design doc. the architecture for the learned per-finding router.
where this lands
with egats disabled:
- xbow white-box: moat reduces findings 60% at 0–2 flag cost. free precision.
- xbow black-box: moat strictly dominates the baseline.
- npm-bench: moat is a no-op. FPR swings are noise. 100% recall preserved.
- picoctf: 80%.
- argus: 40%.
the marketing claim was wrong. the engineering was closer to right than the first data showed. and the first data almost made us turn off something that works.
no single static triage policy wins on all three slices. no-triage wins white-box by raw flags. moat wins black-box in strict pareto. none wins npm-bench on FPR. that is the direct motivation for learned per-finding routing — a classifier that picks which layers to run based on the finding, not a flag the operator sets once.
the per-layer telemetry is the training data. the v2 dataset has 1514 rows. the architecture is inspired by VulnBERT (hybrid handcrafted features + neural embeddings, 91.4% recall at 5.9% FPR on linux kernel commits). the design doc is live, starting with XGBoost on the 45-feature vector as the simplest thing that could work.