Let AI agents pentest your product
before attackers do.

Open-source AI pentesting. Real exploits, no false positives.

0.0%retained XBOW artifacts · 99 of 104
npx pwnkit-cli
GitHub

Built to attack four ecosystems.

One harness. Web, AI, packages, code.

web

Web apps

SQLi, IDOR, XSS, SSRF, auth bypass.

supported · 99 / 104 on XBOW
ai/llm

AI / LLM apps

Prompt injection, jailbreaks, PII leakage, MCP tool abuse.

supported · 6 attack categories
npm

npm packages

Supply chain, malware, CVE replay, typo-squats.

supported · package + transitive
code

Source code

White-box reads the repo before attacking.

supported · diff-aware on PRs

95.2% on retained XBOW artifacts.

99 of 104 challenges. Methodology public.

XBOW · 104 challenges · published headline scores

pwnkitasterisked
  • BoxPwnrbest-of-N
    97.1%101 / 104

    best-of-N · 10+ configs

  • Shannon
    96.2%100 / 104

    modified fork · white-box

  • pwnkitretained artifactsopen source
    95.2%99 / 104

    open source · artifact-backed

  • KinoSec
    92.3%96 / 104

    closed source

  • XBOWown agent
    85.0%88 / 104

    built for own benchmark

Scroll to compare →

SystemXBOW score0 — 104MaintainedComparableNotes
BoxPwnr (best-of-N)97.1%
YesNoBest-of-N across 10+ model+solver configs
Shannon96.15%
YesNoModified hint-free fork + white-box source access
pwnkit (retained artifact aggregate)95.2%
YesNo99/104 retained artifact-backed aggregate on 0ca/xbow-validation-benchmarks-patched · historical 91/96 publication line tracked separately · open source
KinoSec92.3%
YesNoProprietary, closed source
XBOW (own agent)85%
YesNoBuilt by XBOW for their own benchmark
Cyber-AutoAgent85%
Archived Nov 2025YesRepo archived 2025-11-29 — project is dead
BoxPwnr (single config)~80–82%
YesYesApples-to-apples single-config baseline
deadend-cli~80%
YesYesOpen source agent
MAPTA76.9%
YesYesAcademic agent (arXiv:2508.20816)

Comparable = standard 104-challenge XBOW, methodology stated. Modified forks and best-of-N called out so results aren't silently blended.

pnpm bench --agentic · Writeup · Source

Five phases. Same loop a human runs.

Recon to receipt. Every finding re-exploited before it ships.

01

Target

URL, package, or repo.

02

Scan

Shell-first agent loop.

03

Triage

11 layers kill false positives.

04

Verify

A second agent re-exploits, blind.

05

Report

SARIF, JSON, GitHub Security.

Architecture

Just give it a target.

pwnkit-cli express

Audit an npm package

pwnkit-cli ./my-repo

Review source code

pwnkit-cli https://api.com/chat

Scan an LLM API

pwnkit-cli https://example.com --mode web

Pentest a web app

pwnkit-cli dashboard

Local mission control

pwnkit-cli findings list --severity critical

Triage across scans

Auto-detects target type. Drop the GitHub Action into CI for SARIF output.

Multiple agents. One target.

Discover, Attack, Verify, Report — running in parallel. Verify re-exploits blind.

target.example.com/apiweb

deep · started 18m ago · gpt-5.4

running
  • enumerating /api/v1/* (auth scope: bearer)30s
  • httpx -path /api/v1/admin -mc 200,401,40324s
  • /api/v1/admin · 401 (consistent with org-scoped IDOR)18s
  • drafting save_finding(severity=high, kind=idor)12s
  • pwnkit-verify --blind --finding f-2026-04-29-7c3e6s
  • verify · re-exploit succeeded · finding promoted0s
18m 42s·$0.082turn 12 of 16

Illustration. Real scans land in the dashboard.

Research first. Receipts attached.

The tool is downstream of the research.

400M+
weekly attack surface · 8 CVEs disclosed
node-forge 34M weekly mysql2 9.5M weekly Uptime Kuma 152M pulls LiquidJS 1.6M weekly jsPDF 13M weekly picomatch 365M weekly (transitive) paperclip 60k ★
Structural signals

Start locally.
Scale when it matters.

Local today. Cloud when it needs to run on a schedule.

npx pwnkit-cli