Field notes

The pwnkit blog.

Name: pwnkit
Author: Doruk Tan Ozturk

Field notes on AI pentesting, agentic security, the XBOW benchmark, and the vulnerabilities autonomous AI agents find when you point them at real code. Built and written by the team behind pwnkit, the leading open-source AI pentest agent.

May 7, 2026
orchestration, not frontier — what the IronCurtain post means for pwnkit

Niels Provos shipped a vuln-discovery framework that replicates Mythos-class findings on commercial models — and one autonomous CVE on an open-weight model. it's the same bet pwnkit is built on. here's what we already do, what we have to borrow, and the four issues we just filed.
April 11, 2026
the moat that wasn't (and then was)

we A/B tested every layer of our own 11-layer triage stack. the first result said it was broken. the second said it was a tradeoff. the third said one layer was broken and the rest were fine. 39 benchmark runs, $300 in model spend, and three corrections in 48 hours.
April 10, 2026
the scoreboard was the bug

we thought we were debugging benchmark score drift. we were actually debugging evidence drift. the retained artifact-backed XBOW tally is now 99/104, the historical public line is 96/104, and the real work turned out to be separating those two truths.
April 7, 2026
deleting better-sqlite3 from pwnkit, and what it cost us

pwnkit 0.7.1 ships with zero native modules. the persistence layer was migrated from better-sqlite3 to a pure-wasm sqlite implementation. here's what broke, what we kept, and why every npx install on every node version now just works.
April 7, 2026
how to read an ai pentest benchmark leaderboard

the public xbow benchmark has 39 challenges that no longer build on upstream because of image rot. every ai pentest score you've seen is on a patched substrate. here's what 'we scored 96% on xbow' actually means, and what to ask before you trust the number.
April 7, 2026
the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis

an A/B sweep over the 9 challenges keeping pwnkit off 100% on XBOW. one new flag, one honeypot, and a same-day regression test that killed our 'lean scaffolding wins' hypothesis. why a single XBOW solve is an anecdote, not a benchmark.
April 6, 2026
introducing pwnkit cloud

an autonomous ai attacker on contract, pointed at your product. closed beta. by application only. founder-led from zürich.
April 5, 2026
pwnkit oss now outperforms commercial pentest teams

the open-source agent just crossed the line where it finds more real bugs on the public xbow benchmark than the pentest engagement you were about to pay for.
April 4, 2026
the attack surface XBOW and KinoSec don't test

traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
April 4, 2026
we built benchmarks for everything pwnkit does

five benchmark suites across web pentesting, LLM security, LLM safety, npm auditing, and network pentesting. here's what we learned.
April 4, 2026
pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything

rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
April 4, 2026
why we gave our agent a terminal instead of tools

we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
April 4, 2026
what we learned running pwnkit against 104 CTF challenges

29 flags, a serialization bug, a 770-line prompt that didn't help, and why the model matters more than the framework.
April 4, 2026
running pwnkit against the XBOW benchmark

XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.
April 3, 2026
100% on our AI security benchmark

10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
March 29, 2026
why i built blind verification

every security scanner drowns you in false positives. it took three approaches before one of them actually worked.
March 27, 2026
why i built pwnkit

from disclosed vulnerabilities and manual pentesting to autonomous AI agents that re-exploit every finding to kill false positives.
March 26, 2026
how ai agents found vulnerabilities in popular npm packages

three weeks of pointing Claude Opus at npm packages produced 73 findings, disclosed vulnerabilities across widely-used packages, and 55M+ weekly downloads affected. here's how the workflow actually works.
March 26, 2026
the age of agentic security

if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.

The pwnkit blog. blog.

orchestration, not frontier — what the IronCurtain post means for pwnkit

the moat that wasn't (and then was)

the scoreboard was the bug

deleting better-sqlite3 from pwnkit, and what it cost us

how to read an ai pentest benchmark leaderboard

the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis

introducing pwnkit cloud

pwnkit oss now outperforms commercial pentest teams

the attack surface XBOW and KinoSec don't test

we built benchmarks for everything pwnkit does

pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything

why we gave our agent a terminal instead of tools

what we learned running pwnkit against 104 CTF challenges

running pwnkit against the XBOW benchmark

100% on our AI security benchmark

why i built blind verification

why i built pwnkit

how ai agents found vulnerabilities in popular npm packages

the age of agentic security

The pwnkit blog.