Blog / / 12 min read

the moat that wasn't (and then was)

we A/B tested every layer of our own 11-layer triage stack. the first result said it was broken. the second said it was a tradeoff. the third said one layer was broken and the rest were fine. 39 benchmark runs, $300 in model spend, and three corrections in 48 hours.

Update — 2026-04-12. Batch 2 landed overnight. The original batch 1 framing — “moat costs 2 flags on white-box” and “stable features cause npm-bench FPR” — did not replicate. Removing the one broken layer (egats) closed the flag gap to noise. The npm-bench FPR swing was a 2-package flip on 27 safe packages. Corrected numbers inline below; original framing preserved so the trail stays honest.

v0.6.0 shipped on April 6 with an 11-layer triage pipeline we called “the moat.” the marketing claim was “FPR down from ~50% to under 5%, matching Endor Labs’ 95% and Semgrep Assistant’s 96%.” we pointed at it when people asked what our moat was. it was the literal name of the directory.

then we ran the ablation.

the prior data that was wrong twice

before the ablation we had two internal data points.

npm-bench F1 = 0.444. filed as a “recall problem.” cited in internal docs for a week. the score was from a 30-package slice. the current 81-package set scores F1 = 0.973 with 100% recall across every profile. the recall problem does not exist on the live test set.

stubborn-14 ablation: 4 flags baseline → 0 flags with moat. we almost turned the moat off based on this. stubborn-14 is the 14 hardest challenges. n=14 with LLM non-determinism. each flag is a 7.1pp swing. the limit=50 result told the opposite story. stubborn-slice evaluations lie — they diagnose failure modes, they don’t measure whether to ship.

21 runs, $300, 6 hours

four profiles. two modes. five npm-bench configs. seven single-feature isolations. everything feature-flagged and reproducible via gh workflow run xbow-bench.yml -f features=<profile>.

profiles: none (zero triage), no-triage (stable features on, triage off), moat-only (triage on, stable off), moat (everything).

xbow white-box, limit=50

profileflagsfindingscost$/flag
none43/5067$14$0.33
no-triage44/5067$17$0.39
moat-only41/5025$27$0.66
moat41/5025$22$0.53

the moat cuts findings 63% (67 → 25) and costs 2 flags (44 → 41). that is a pareto tradeoff. whether it is a good one depends on whether the downstream user wants fewer findings or more flags.

Batch 2 correction. After removing egats (see below), these numbers became 42/50 on both moat profiles at $16/run. all four profiles are 42–44. the 2-flag gap was egats-specific. with egats off, the moat is effectively free precision.

xbow black-box, limit=25

profileflagsfindingscost$/flag
none18/2527$14$0.76
no-triage19/2534$10$0.55
moat-only18/2513$11$0.62
moat19/2514$10$0.53

strict pareto dominance. more flags, fewer findings, cheaper per flag. the moat works in black-box. the hypothesis: when the agent has no source code, the triage layers add value by re-checking noisy external-signal findings. when it has source code, it generates high-confidence exploits and the triage layers second-guess a confident agent.

npm-bench, 81 packages

profileF1TPRFPRsafe correct
none0.9731.000.1124/27
no-triage0.9641.000.1523/27
moat-only0.9641.000.1523/27
moat0.9561.000.1922/27
default0.9561.000.1922/27

moat and default are identical. the 11-layer triage stack contributes nothing to FPR on npm-bench. the FPR increase from none (0.11) to default (0.19) comes from the stable features — early-stop, script templates, progress handoff — making the agent more productive on safe packages.

Batch 2 correction. The 0.19 did not replicate. batch 2’s default got 0.11. the swing is ±2 packages on 27 safe samples. need repeat=3 before drawing FPR conclusions at this sample size.

100% recall on every profile. every malicious package, every vulnerable package caught.

the one broken layer

seven profiles against the stubborn-14, each adding exactly one moat layer to default:

layerflagsdelta$/flag
default2/14$3.62
+pov4/14+2$2.39
+reachability5/14+3$1.61
+multimodal3/14+1$2.52
+debate5/14+3$2.65
+memories4/14+2$3.35
+egats1/14−1$15.93
+consensus3/14+1$2.67

six of seven help. egats loses a flag and costs 10x the next-worst layer. when the full moat runs, egats prunes exploration branches the other layers would have used. the interaction is multiplicatively destructive on hard challenges.

this is the full explanation of the original “moat catastrophically regresses” finding. it was not the moat. it was one layer of the moat. we disabled egats in the default profile the same afternoon. the code stays in the tree, the default is off.

reachability is the standout: +3 flags at $1.61 each. best cost-per-flag of any layer.

batch 2: the correction of the correction

overnight we re-ran the full white-box matrix against the commit that disabled egats.

profilebatch 1batch 2delta
none43/5044/50+1
no-triage44/5043/50−1
moat-only41/5042/50+1
moat41/5042/50+1

removing egats: +1 flag, −25% cost on the moat profiles. all four are 42–44. the gap is noise.

we also ran single-feature isolation on npm-bench stable features:

profileF1TPRFPR
default (batch 2)0.9731.000.11
no-script-templates0.9640.980.11
no-handoff0.9731.000.11

the batch 1 “stable features cause FPR” finding did not replicate. batch 2 default matched batch 1 none. the 0.19 was a 2-package noise swing. no-script-templates loses a detection — script templates help recall. do not disable them.

a finding that does not replicate 12 hours later on the same code is noise, not signal. 27 safe packages is not enough for single-run FPR conclusions.

overnight cross-benchmark

while the ablation was running, benchmark suites across five other domains completed:

benchmarkscorenotes
picoctf8/10 (80%)client-side, robots.txt, login bypass, sqli
hackbench3/5 (60%)subset of the 16-challenge suite
argus2/5 (40%)multi-step APT scenarios
portswigger0/10zero findings — likely expired lab session, not agent limitation
bountybench0/3different attack domain (patch/detect/exploit)

portswigger at 0/10 with zero findings on basic sqli and reflected xss is suspicious in a “the session expired” way, not a “the agent can’t do xss” way. investigating.

what replicated across both batches

  • egats is the one broken layer. removing it improved moat +1 flag, −25% cost.
  • the moat cuts findings ~60%. batch 1: 67→25. batch 2: 72→27.
  • 100% recall on npm-bench. every profile, both batches.
  • black-box moat strictly dominates. 37/50 at limit=50.

what did not replicate

  • “the moat costs 2 flags on white-box” — after egats, the gap is 0–2. noise.
  • “stable features cause npm-bench FPR” — batch 2 default got 0.11, matching batch 1 none.

what shipped in 48 hours

  1. per-finding layer telemetry. every finding logs which triage layer touched it, what verdict, the cost, the duration. commit 6f1a889.
  2. npm-bench feature-flag support. mirrors xbow-bench. enabled the ablation matrix.
  3. triage-dataset-v2.jsonl. 1514 labeled rows from 32 results files. 163 with per-layer verdicts.
  4. egats disabled in all moat profiles. commit aadcf32.
  5. stable-feature isolation tokens. no-script-templates, no-handoff, no-early-stop.
  6. fp-reduction-moat.md rewritten with measured numbers. seven other docs updated.
  7. dynamic routing design doc. the architecture for the learned per-finding router.

where this lands

with egats disabled:

  • xbow white-box: moat reduces findings 60% at 0–2 flag cost. free precision.
  • xbow black-box: moat strictly dominates the baseline.
  • npm-bench: moat is a no-op. FPR swings are noise. 100% recall preserved.
  • picoctf: 80%.
  • argus: 40%.

the marketing claim was wrong. the engineering was closer to right than the first data showed. and the first data almost made us turn off something that works.

no single static triage policy wins on all three slices. no-triage wins white-box by raw flags. moat wins black-box in strict pareto. none wins npm-bench on FPR. that is the direct motivation for learned per-finding routing — a classifier that picks which layers to run based on the finding, not a flag the operator sets once.

the per-layer telemetry is the training data. the v2 dataset has 1514 rows. the architecture is inspired by VulnBERT (hybrid handcrafted features + neural embeddings, 91.4% recall at 5.9% FPR on linux kernel commits). the design doc is live, starting with XGBoost on the 45-feature vector as the simplest thing that could work.