The pwnkit
blog.
Field notes on AI pentesting, agentic security, the XBOW benchmark, and the vulnerabilities autonomous AI agents find when you point them at real code. Built and written by the team behind pwnkit, the leading open-source AI pentest agent.
-
orchestration, not frontier — what the IronCurtain post means for pwnkit
Niels Provos shipped a vuln-discovery framework that replicates Mythos-class findings on commercial models — and one autonomous CVE on an open-weight model. it's the same bet pwnkit is built on. here's what we already do, what we have to borrow, and the four issues we just filed.
-
the moat that wasn't (and then was)
we A/B tested every layer of our own 11-layer triage stack. the first result said it was broken. the second said it was a tradeoff. the third said one layer was broken and the rest were fine. 39 benchmark runs, $300 in model spend, and three corrections in 48 hours.
-
the scoreboard was the bug
we thought we were debugging benchmark score drift. we were actually debugging evidence drift. the retained artifact-backed XBOW tally is now 99/104, the historical public line is 96/104, and the real work turned out to be separating those two truths.
-
deleting better-sqlite3 from pwnkit, and what it cost us
pwnkit 0.7.1 ships with zero native modules. the persistence layer was migrated from better-sqlite3 to a pure-wasm sqlite implementation. here's what broke, what we kept, and why every npx install on every node version now just works.
-
how to read an ai pentest benchmark leaderboard
the public xbow benchmark has 39 challenges that no longer build on upstream because of image rot. every ai pentest score you've seen is on a patched substrate. here's what 'we scored 96% on xbow' actually means, and what to ask before you trust the number.
-
the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis
an A/B sweep over the 9 challenges keeping pwnkit off 100% on XBOW. one new flag, one honeypot, and a same-day regression test that killed our 'lean scaffolding wins' hypothesis. why a single XBOW solve is an anecdote, not a benchmark.
-
introducing pwnkit cloud
an autonomous ai attacker on contract, pointed at your product. closed beta. by application only. founder-led from zürich.
-
pwnkit oss now outperforms commercial pentest teams
the open-source agent just crossed the line where it finds more real bugs on the public xbow benchmark than the pentest engagement you were about to pay for.
-
the attack surface XBOW and KinoSec don't test
traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
-
we built benchmarks for everything pwnkit does
five benchmark suites across web pentesting, LLM security, LLM safety, npm auditing, and network pentesting. here's what we learned.
-
pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything
rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
-
why we gave our agent a terminal instead of tools
we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
-
what we learned running pwnkit against 104 CTF challenges
29 flags, a serialization bug, a 770-line prompt that didn't help, and why the model matters more than the framework.
-
running pwnkit against the XBOW benchmark
XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.
-
100% on our AI security benchmark
10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
-
why i built blind verification
every security scanner drowns you in false positives. it took three approaches before one of them actually worked.
-
why i built pwnkit
from disclosed vulnerabilities and manual pentesting to autonomous AI agents that re-exploit every finding to kill false positives.
-
how ai agents found vulnerabilities in popular npm packages
three weeks of pointing Claude Opus at npm packages produced 73 findings, disclosed vulnerabilities across widely-used packages, and 55M+ weekly downloads affected. here's how the workflow actually works.
-
the age of agentic security
if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.