Let AI agents pentest your product
before attackers do.
Open-source AI pentesting. Real exploits, no false positives.
npx pwnkit-cli Built to attack four ecosystems.
One harness. Web, AI, packages, code.
Web apps
SQLi, IDOR, XSS, SSRF, auth bypass.
AI / LLM apps
Prompt injection, jailbreaks, PII leakage, MCP tool abuse.
npm packages
Supply chain, malware, CVE replay, typo-squats.
Source code
White-box reads the repo before attacking.
95.2% on retained XBOW artifacts.
99 of 104 challenges. Methodology public.
XBOW · 104 challenges · published headline scores
- BoxPwnrbest-of-N97.1%101 / 104
best-of-N · 10+ configs
- Shannon96.2%100 / 104
modified fork · white-box
- pwnkitretained artifactsopen source95.2%99 / 104
open source · artifact-backed
- KinoSec92.3%96 / 104
closed source
- XBOWown agent85.0%88 / 104
built for own benchmark
Scroll to compare →
| System | XBOW score | 0 — 104 | Maintained | Comparable | Notes |
|---|---|---|---|---|---|
| BoxPwnr (best-of-N) | 97.1% | Yes | No | Best-of-N across 10+ model+solver configs | |
| Shannon | 96.15% | Yes | No | Modified hint-free fork + white-box source access | |
| pwnkit (retained artifact aggregate) | 95.2% | Yes | No | 99/104 retained artifact-backed aggregate on 0ca/xbow-validation-benchmarks-patched · historical 91/96 publication line tracked separately · open source | |
| KinoSec | 92.3% | Yes | No | Proprietary, closed source | |
| XBOW (own agent) | 85% | Yes | No | Built by XBOW for their own benchmark | |
| Cyber-AutoAgent | 85% | Archived Nov 2025 | Yes | Repo archived 2025-11-29 — project is dead | |
| BoxPwnr (single config) | ~80–82% | Yes | Yes | Apples-to-apples single-config baseline | |
| deadend-cli | ~80% | Yes | Yes | Open source agent | |
| MAPTA | 76.9% | Yes | Yes | Academic agent (arXiv:2508.20816) |
Comparable = standard 104-challenge XBOW, methodology stated. Modified forks and best-of-N called out so results aren't silently blended.
Five phases. Same loop a human runs.
Recon to receipt. Every finding re-exploited before it ships.
Target
URL, package, or repo.
Scan
Shell-first agent loop.
Triage
11 layers kill false positives.
Verify
A second agent re-exploits, blind.
Report
SARIF, JSON, GitHub Security.
Just give it a target.
Audit an npm package
Review source code
Scan an LLM API
Pentest a web app
Local mission control
Triage across scans
Auto-detects target type. Drop the GitHub Action into CI for SARIF output.
Multiple agents. One target.
Discover, Attack, Verify, Report — running in parallel. Verify re-exploits blind.
deep · started 18m ago · gpt-5.4
Illustration. Real scans land in the dashboard.
Research first. Receipts attached.
The tool is downstream of the research.
One stack.
Four layers.
Four layers.
Detect, run, prevent, respond. Each layer independent, all open source.
Start locally.
Scale when it matters.
Scale when it matters.
Local today. Cloud when it needs to run on a schedule.
npx pwnkit-cli