on April 29 Niels Provos published Finding Zero-Days with Any Model. headline thesis, in his own words: “vulnerability discovery is an orchestration problem, not a frontier-model problem.”

he replicates the 1998 OpenBSD TCP SACK bug — the one he committed himself, 27 years ago — using Sonnet 4.6 and Opus 4.6 driven by his open-source IronCurtain framework. then he points the same workflow at a foundational library, swaps the model out for Z.AI’s GLM 5.1 over a LiteLLM gateway, and finds an integer-truncation flaw that’s been sitting on a memory-allocation path for 18 years. the orchestration layer doesn’t change. only the model does.

this is the same bet pwnkit is built on. our XBOW black-box record of 97/104 (93.3%) is on Sonnet 4.6 — not on a frontier-restricted preview model. our white-box 101/104 is on the same provider mix anyone with an OpenRouter key can use. the IronCurtain post validates the macro story: orchestration scaffolding extracts capability that vendors gate behind embargoed releases. the floor for what commodity models can do is now low enough to clear it.

so I read his post line-by-line and asked: where is pwnkit ahead, where is pwnkit behind, and what do we have to borrow.

where pwnkit already maps to his design

hypothesize statically, validate by execution. he calls it the central FSM discipline. we call it blind verification: every finding gets independently re-exploited before it shows up in a report. same idea, different name. the disclose pipeline that shipped in PR #206 enforces this at the advisory layer too — no PoC, no advisory.

tiered harness construction. Provos describes his harness ladder as: single-function isolation harness → multi-component harness → full end-to-end VM validation. pwnkit already runs tier 3 for kernel crashes — the QEMU validator inside pwnkit ingest --verify compiles reproducers inside the guest and watches for KASAN/UBSAN oopses. that path was built for kernel work. the same plumbing now needs to extend into source-code review of foundational C libraries (more on that below).

model-agnostic. he reroutes Anthropic identifiers to Z.AI through a LiteLLM gateway and ships GLM 5.1 end-to-end without changing IronCurtain. pwnkit already routes through OpenRouter and Azure OpenAI. running GLM 5.1 against XBOW is a config flag away — and we should do it, because “orchestration > frontier-model” is a thesis you have to keep proving.

where pwnkit is genuinely behind

four gaps. all four are now filed as issues on the repo.

1. append-only execution journal as source of truth (#224)

IronCurtain’s central architectural choice is that the Orchestrator agent does not read source code. it routes off an append-only execution journal. every specialist agent gets a fresh context window and rehydrates the slice of journal it needs. the journal is the source of truth; the model’s working memory is disposable.

pwnkit’s loop in packages/core/src/agent/native-loop.ts carries investigation state in the conversation window. when context fills up, a summarizer kicks in and lossy-compresses. this caps the size of investigation we can run, makes recovery from mid-run failures lossy, and prevents clean parallelization.

this is the same architectural shape that drove BoxPwnr (0ca’s framework) to a 97.1% XBOW score: durable journal, fresh contexts per dispatch, strategic router that never reads the artifact directly. two independent groups have now validated the same answer. it’s the technique.

filed as #224. it’s a 3-5 day refactor with a benchmark validation cycle, gated behind features=journal-loop, and it has to clear our current XBOW BB tally on a 30-run pilot before flipping default. moving slowly here is the point.

2. YAML-defined FSM workflows (#225)

IronCurtain ships its workflows as plain YAML FSM definitions. one Orchestrator interprets the FSM. new workflows are contributed as YAML, not as TypeScript PRs against the loop driver.

pwnkit’s investigation flow is hard-coded in TS across playbooks.ts, prompts.ts, and the loop drivers. third parties can’t fork a workflow without forking the repo. that’s fine while the workflow is one thing (web pentest). it stops scaling the moment you want vuln-discovery, kernel-crash-triage, package-audit, code-review-c-cpp to live as four artifacts side-by-side.

filed as #225. depends on #224 — the Orchestrator from #224 is the FSM interpreter, the journal is its observation surface.

3. C/C++ source-code-review workflow (#226)

this is the one that actually moves the needle on coverage. Provos’s post is explicitly about a media framework and an integer-truncation flaw on a memory allocation path in a foundational library. both are C/C++ memory-safety primitives. neither is in scope for pwnkit scan (web/LLM) or pwnkit audit (npm/pypi/cargo packages).

pwnkit review exists, but it’s positioned and prompt-tuned for application-layer code review — type-safety bugs, auth gaps, business-logic flaws. it is not positioned as “give me a CVE in this widely-deployed C library.” the gap is positioning + workflow, not model capability.

filed as #226. delivers a new CLI route (pwnkit review --target c-library), a tier-1 libFuzzer harness scaffolder, a tier-2 multi-component linker helper, and tier-3 validation that reuses the kernel-crash QEMU plumbing. one synthetic reference target so the test suite can prove the workflow finds a bug autonomously. ships as a YAML-defined FSM workflow once #225 lands.

4. per-investigation cost reporting (#227)

Provos publishes hard $/investigation numbers on his marketing surface. ~$30 on Sonnet 4.6, ~$150 on Opus 4.6, ~$30-equivalent on GLM 5.1 at higher token volume. that number is now the metric defenders compare on. “how many libraries can I afford to audit per year” is a budget conversation, not a benchmark conversation.

pwnkit’s README leads with flag counts. defensible numbers — XBOW 97/104 BB, 101/104 WB, retained-artifact-backed — but flag counts don’t give a buyer the math they need to size a contract. we already track tokens and per-run cost in benchmark artifacts. we just don’t surface a single headline $/flag on the docs.

filed as #227. small, independent, lands first. centralizes pricing in one source-of-truth file (we already have a pricing table in packages/core/src/agent/cost.ts, just not surfaced consistently), extends the benchmark consolidator to emit $/run and $/flag per profile, adds a comparison table to the benchmark docs page.

one note on the responsibility framing

Provos closes the post with an argument that’s worth quoting directly:

“Every defensive tool of the past 25 years (Metasploit, nmap, Burp Suite, AFL) faced the same debate, and the historical answer has been to put the tools in defender hands. On a local model, accountability rests directly with the researcher, as it has for those tools all along.”

we agree. pwnkit’s H1-readiness work in PR #206 (auth-header redaction, scope allowlist on PoC runtime, refusal to render advisories with empty PoCs, code-verified-by-pwnkit footer that only renders on reverify+canary success) is the same bet from the disclosure-pipeline side: defenders ship vulnerability tooling under accountability, not under a permission gradient gated by the vendor.

the H1 Code of Conduct already takes a hard line on AI-generated low-quality submissions — Final Warning on first offense, 12-month ban on second, permanent ban on third. that’s not a problem if your tooling produces verified PoCs and refuses to ship advisories without them. it’s a problem if your tooling auto-submits static-analysis output. pwnkit is built for the former. the orchestration layer that finds the bug and the disclosure layer that filters before submission are the same pipeline.

what ships first

#227 (cost telemetry) is independent and small. lands this week.

#226 (C/C++ workflow scaffold) ships in two slices: the new CLI route + tier-1 harness scaffolder land this week as a TypeScript playbook. once #224 + #225 land, it gets reframed as a YAML FSM workflow.

#224 (journal/orchestrator refactor) gets a design doc on the issue this week. implementation lands behind a feature flag with a 30-run pilot benchmark before the default flips. moving slowly here is the point — this is the load-bearing piece.

#225 (YAML FSM workflows) ships on top of #224 once the orchestrator is in place.

if you’ve been watching the IronCurtain release and wondering whether pwnkit is on the right track: yes, and the gap is execution, not architecture. the four issues above are the punch list.

—

pwnkit is open source under Apache 2.0. issues, PRs, and benchmark contributions are welcome at github.com/PwnKit-Labs/pwnkit.