Let AI agents pentest your product
before attackers do.
Open-source AI pentesting. Real exploits, no false positives.
npx pwnkit-cli Scan four ecosystems with one loop.
Web apps, LLM APIs, npm packages, and source — in parallel, every commit.
Illustration. Real scans land in the dashboard.
Find real CVEs in code that runs the internet.
Seven public advisories. The agent did the work.
95.2% on retained XBOW artifacts.
99 of 104 challenges. Methodology public.
XBOW · 104 challenges · published headline scores
- BoxPwnrbest-of-N97.1%101 / 104
best-of-N · 10+ configs
- Shannon96.2%100 / 104
modified fork · white-box
- pwnkitretained artifactsopen source95.2%99 / 104
open source · artifact-backed
- KinoSec92.3%96 / 104
closed source
- XBOWown agent85.0%88 / 104
built for own benchmark
Scroll to compare →
| System | XBOW score | 0 — 104 | Maintained | Comparable | Notes |
|---|---|---|---|---|---|
| BoxPwnr (best-of-N) | 97.1% | Yes | No | Best-of-N across 10+ model+solver configs | |
| Shannon | 96.15% | Yes | No | Modified hint-free fork + white-box source access | |
| pwnkit (retained artifact aggregate) | 95.2% | Yes | No | 99/104 retained artifact-backed aggregate on 0ca/xbow-validation-benchmarks-patched · historical 91/96 publication line tracked separately · open source | |
| KinoSec | 92.3% | Yes | No | Proprietary, closed source | |
| XBOW (own agent) | 85% | Yes | No | Built by XBOW for their own benchmark | |
| Cyber-AutoAgent | 85% | Archived Nov 2025 | Yes | Repo archived 2025-11-29 — project is dead | |
| BoxPwnr (single config) | ~80–82% | Yes | Yes | Apples-to-apples single-config baseline | |
| deadend-cli | ~80% | Yes | Yes | Open source agent | |
| MAPTA | 76.9% | Yes | Yes | Academic agent (arXiv:2508.20816) |
Comparable = standard 104-challenge XBOW, methodology stated. Modified forks and best-of-N called out so results aren't silently blended.
Run the same loop a human pentester runs.
Recon to receipt. Every finding re-exploited before it ships.
Aim
URL, package, or repo.
Scan
Shell-first agent loop.
Triage
11 layers kill false positives.
Verify
A second agent re-exploits, blind.
Ship
SARIF, JSON, GitHub Security.
Just give it a target.
Audit an npm package
Review source code
Scan an LLM API
Pentest a web app
Local mission control
Triage across scans
Auto-detects target type. Drop the GitHub Action into CI for SARIF output.
Start locally.
Scale when it matters.
Scale when it matters.
Local today. Cloud when it needs to run on a schedule.
npx pwnkit-cli