Let AI agents pentest your product
before attackers do.
The leading open-source adversarial testing for AI systems, web apps, code, and packages.
npx pwnkit-cli Shell-first. Minimal tools. Real exploits.
Web Apps
SQLi, IDOR, SSTI, XSS, auth bypass, SSRF — 99/104 on XBOW
AI/LLM Apps
Prompt injection, jailbreaks, PII leakage, MCP tool abuse
npm Packages
Supply chain attacks, malware, CVEs, typosquatting
Source Code
White-box mode reads code before attacking
95.2% on retained XBOW artifacts.
99 of 104 challenges in the current retained artifact-backed aggregate on the community-patched fork. Full methodology and historical context live on the benchmark page.
XBOW · 104 challenges
Published benchmark headline scores. The pwnkit row is the current retained artifact-backed aggregate, with the older historical publication line documented separately on the benchmark page.
- BoxPwnrbest-of-N97.1%101 / 104
best-of-N · 10+ configs
- Shannon96.2%100 / 104
modified fork · white-box
- pwnkitretained artifactsopen source95.2%99 / 104
open source · artifact-backed
- KinoSec92.3%96 / 104
closed source
- XBOWown agent85.0%88 / 104
built for own benchmark
Scroll to compare →
| System | XBOW score | 0 — 104 | Maintained? | Comparable? | Notes |
|---|---|---|---|---|---|
| BoxPwnr (best-of-N) | 97.1% | | Yes | No | Best-of-N across 10+ model+solver configs |
| Shannon | 96.15% | | Yes | No | Modified hint-free fork + white-box source access |
| pwnkit (retained artifact aggregate) | 95.2% | | Yes | No | 99/104 retained artifact-backed aggregate on 0ca/xbow-validation-benchmarks-patched · historical 91/96 publication line tracked separately · open source |
| KinoSec | 92.3% | | Yes | No | Proprietary, closed source |
| XBOW (own agent) | 85% | | Yes | No | Built by XBOW for their own benchmark |
| Cyber-AutoAgent | 85% | | Archived Nov 2025 | Yes | Repo archived 2025-11-29 — project is dead |
| BoxPwnr (single config) | ~80-82% | | Yes | Yes | Apples-to-apples single-config baseline |
| deadend-cli | ~80% | | Yes | Yes | Open source agent |
| MAPTA | 76.9% | | Yes | Yes | Academic agent (arXiv:2508.20816) |
Comparable = standard 104-challenge XBOW with methodology stated explicitly in the row. Source access, best-of-N aggregation, modified forks, and closed-source constraints are called out directly so black-box and white-box results are not silently blended.
Run it yourself: pnpm bench --agentic ·
Full benchmark writeup
·
Source
Built for builders.
One model. One command. Every layer open and inspectable.
Real exploits, not pattern matching
Every finding is independently re-exploited by a blind verify agent that never sees the original reasoning. If it can't be proven, it doesn't ship.
11-layer triage
Holding-it-wrong filter, per-class oracles, reachability gate, multi-modal cross-validation, adversarial debate. Every finding survives the gauntlet before you see it.
Apache 2.0
Read every line. Fork it. Vendor it. 421 tests, 44k lines of TypeScript, daily releases. No SaaS lock-in, no per-finding billing, no asterisks.
Target → Scan → Triage → Verify → Report
The same loop a human pentester runs — recon, attack, verify, report — except every finding is independently re-exploited before you see it.
npm package, or repo
with bash, curl, sqlmap
kills false positives
by a second agent
or GitHub Issues
Just give it a target.
pwnkit-cli express Audit an npm package
pwnkit-cli ./my-repo Review source code
pwnkit-cli https://api.com/chat Scan an LLM API
pwnkit-cli https://example.com --mode web Pentest a web app
pwnkit-cli dashboard Local mission control
pwnkit-cli findings list --severity critical Triage across scans
Auto-detects target type. No subcommands needed for most targets.
CLI runs the scans. pwnkit-cli dashboard opens a local web UI for triage and evidence review.
Drop the GitHub Action into CI to push findings into GitHub's Security tab as SARIF.
Multiple agents. One target.
Discover, Attack, Verify, and Report agents run in parallel against the same target. The Verify agent independently re-exploits every finding from scratch — if it can't reproduce the bug, the finding is killed before it reaches your inbox.
Research first.
Receipts attached.
7 CVEs found in packages with 40M+ weekly downloads — including node-forge, mysql2, Uptime Kuma, LiquidJS, jsPDF, and picomatch. The product is downstream of that work, not the other way around.
One stack.
Four layers.
Detect, run continuously, prevent, respond. Each layer is independent and open source. Use one or use all four.
Start locally.
Scale when it matters.
Run pwnkit on your laptop or in CI today. When your
system needs continuous adversarial testing on a schedule, upgrade
to pwnkit cloud.
npx pwnkit-cli