Blog / / 6 min read

pwnkit oss now outperforms commercial pentest teams

the open-source agent just crossed the line where it finds more real bugs on the public xbow benchmark than the pentest engagement you were about to pay for.

picture a founder two weeks after a series A. the product is shipping, the SOC2 auditor is asking pointed questions, and the only security budget item so far was a burp scan a contractor ran six months ago. the quotes coming back from pentest shops are $25k for a week of work, a pdf at the end, and a three-month lead time before anyone even starts. meanwhile the product keeps shipping every day.

or picture a bug-bounty hunter on a sunday night staring at a scope of 200 npm packages, a private program in beta, and no idea which target to pull on first.

both of these people now have a tool that didn’t exist a month ago.

what pwnkit is

pwnkit is an open-source, autonomous AI pentesting agent. point it at a url, an api, an npm package, or a source tree — it runs reconnaissance, attacks, and validates every finding with a working proof of concept. no signatures, no regex rules, no dashboard tax.

the milestone

as of this week, pwnkit oss outperforms commercial pentest teams on the public xbow benchmark. xbow is the hardest open benchmark for ai pentesting agents: 104 real exploitation challenges across sqli, ssti, idor, ssrf, lfi, auth bypass, deserialization, and the rest of the owasp top 10 plus a long tail. not detection. actual flag extraction, end-to-end.

until recently, the top of that leaderboard was closed-source commercial stacks. pwnkit is now competitive with the best published single-model results — as an mit-licensed cli anyone can npm install tonight. the live numbers are maintained at docs.pwnkit.com/benchmark and update every ci run.

the shorthand: the median mid-market pentest engagement is a week of human hours, a pdf, and a five-figure invoice. pwnkit gives an engineering team the same class of output — exploit-validated findings, not “possible issues” — on demand, for the cost of a few api calls. a category of work that used to be gate-kept behind procurement is now a command you type.

what you can actually scan with it

pwnkit runs across four surfaces out of the box:

  • web apps and apispwnkit scan --target https://your-app. handles auth, openapi specs, and stateful flows. the same engine that crosses xbow runs against production targets.
  • ai and llm apps — prompt injection, jailbreak probes, system-prompt extraction, pii leakage, mcp-based ssrf. the agent understands these because it is one.
  • npm packagespwnkit audit --package <name>. the workflow behind the 7 cves published in popular npm packages earlier this year is the same engine, packaged.
  • source code — point it at a repo and it reads the code the way a human researcher would, except it doesn’t get tired on file 400.

these aren’t demo targets. the engine has produced cves in node-forge (32m weekly downloads), mysql2, jspdf, and liquidjs — all from the oss code path. the repo is at github.com/PwnKit-Labs/pwnkit, the docs are at docs.pwnkit.com.

how it actually works

most “ai security” tools are a wrapper around the scanners you already have, with a chatbot stapled on for the report. pwnkit is not that. it’s an agent loop that drives a shell directly.

the design choice that matters: one primary tool, bash. the model already knows curl, sqlmap, nmap, jq, and every other tool in a kali rootfs from its training data — there’s no schema to learn, no parameter translation, no tool-selection overhead. the agent reasons about a target, runs a command, reads the output, and iterates. exactly how a human would, except methodically, at every endpoint, without getting bored.

the second design choice: every finding gets re-exploited in a separate blind verification pass before it’s reported. if the verifier can’t reproduce the bug from scratch, the finding is killed. no theoretical risks, no “possible sqli”, no vendor-style false-positive spam. working exploit or it didn’t happen.

the result is an agent that finds the bugs scanners miss, because it actually tries them.

what’s coming

the methodology is open and the roadmap is in the public repo. the near-term work is closing the remaining gap on the harder xbow categories (stateful multi-step chains, environment-dependent exploits), expanding the kali tool surface the agent can drive, and tightening cost controls so long runs stay predictable. none of that requires a new model. it’s engineering, and it’s happening in the open.

the longer-term direction is simple: if an open-source agent can already match commercial pentest output on a public benchmark, the gap closes further every month. the question is how much of traditional vendor security ends up as a command-line tool.

try it

npm install -g pwnkit-cli
pwnkit scan --target https://your-app

or clone the repo, read the source, file an issue, open a pr. full docs at docs.pwnkit.com. the benchmark numbers live at docs.pwnkit.com/benchmark and the source lives at github.com/PwnKit-Labs/pwnkit.

the week that used to cost $25k is a command now. that’s the update.