Let AI agents pentest your product
before attackers do.

Open-source AI pentesting. Real exploits, no false positives.

npx pwnkit-cli
GitHub

Scan four ecosystems with one loop.

Web apps, LLM APIs, npm packages, and source — in parallel, every commit.

Illustration. Real scans land in the dashboard.

Find real CVEs in code that runs the internet.

Seven public advisories. The agent did the work.

400M+
weekly downloads · all found by the agent
paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical jsPDF · 13M weekly · CVE-2026-31938 critical node-forge · 34M weekly · CVE-2026-33896 high mysql2 · 9.5M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high picomatch · 365M weekly high Uptime Kuma · 152M pulls · CVE-2026-33130 medium paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical jsPDF · 13M weekly · CVE-2026-31938 critical node-forge · 34M weekly · CVE-2026-33896 high mysql2 · 9.5M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high picomatch · 365M weekly high Uptime Kuma · 152M pulls · CVE-2026-33130 medium paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical jsPDF · 13M weekly · CVE-2026-31938 critical node-forge · 34M weekly · CVE-2026-33896 high mysql2 · 9.5M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high picomatch · 365M weekly high Uptime Kuma · 152M pulls · CVE-2026-33130 medium paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical jsPDF · 13M weekly · CVE-2026-31938 critical node-forge · 34M weekly · CVE-2026-33896 high mysql2 · 9.5M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high picomatch · 365M weekly high Uptime Kuma · 152M pulls · CVE-2026-33130 medium
Uptime Kuma · 152M pulls · CVE-2026-33130 medium picomatch · 365M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high mysql2 · 9.5M weekly high node-forge · 34M weekly · CVE-2026-33896 high jsPDF · 13M weekly · CVE-2026-31938 critical paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical Uptime Kuma · 152M pulls · CVE-2026-33130 medium picomatch · 365M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high mysql2 · 9.5M weekly high node-forge · 34M weekly · CVE-2026-33896 high jsPDF · 13M weekly · CVE-2026-31938 critical paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical Uptime Kuma · 152M pulls · CVE-2026-33130 medium picomatch · 365M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high mysql2 · 9.5M weekly high node-forge · 34M weekly · CVE-2026-33896 high jsPDF · 13M weekly · CVE-2026-31938 critical paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical Uptime Kuma · 152M pulls · CVE-2026-33130 medium picomatch · 365M weekly high LiquidJS · 1.6M weekly · CVE-2026-30952 high mysql2 · 9.5M weekly high node-forge · 34M weekly · CVE-2026-33896 high jsPDF · 13M weekly · CVE-2026-31938 critical paperclip · 60k ★ · GHSA-47wq-cj9q-wpmp critical

95.2% on retained XBOW artifacts.

99 of 104 challenges. Methodology public.

XBOW · 104 challenges · published headline scores

pwnkitasterisked
  • BoxPwnrbest-of-N
    97.1%101 / 104

    best-of-N · 10+ configs

  • Shannon
    96.2%100 / 104

    modified fork · white-box

  • pwnkitretained artifactsopen source
    95.2%99 / 104

    open source · artifact-backed

  • KinoSec
    92.3%96 / 104

    closed source

  • XBOWown agent
    85.0%88 / 104

    built for own benchmark

Scroll to compare →

SystemXBOW score0 — 104MaintainedComparableNotes
BoxPwnr (best-of-N)97.1%
YesNoBest-of-N across 10+ model+solver configs
Shannon96.15%
YesNoModified hint-free fork + white-box source access
pwnkit (retained artifact aggregate)95.2%
YesNo99/104 retained artifact-backed aggregate on 0ca/xbow-validation-benchmarks-patched · historical 91/96 publication line tracked separately · open source
KinoSec92.3%
YesNoProprietary, closed source
XBOW (own agent)85%
YesNoBuilt by XBOW for their own benchmark
Cyber-AutoAgent85%
Archived Nov 2025YesRepo archived 2025-11-29 — project is dead
BoxPwnr (single config)~80–82%
YesYesApples-to-apples single-config baseline
deadend-cli~80%
YesYesOpen source agent
MAPTA76.9%
YesYesAcademic agent (arXiv:2508.20816)

Comparable = standard 104-challenge XBOW, methodology stated. Modified forks and best-of-N called out so results aren't silently blended.

pnpm bench --agentic · Writeup · Source

Run the same loop a human pentester runs.

Recon to receipt. Every finding re-exploited before it ships.

1.0

Aim

URL, package, or repo.

2.0

Scan

Shell-first agent loop.

3.0

Triage

11 layers kill false positives.

4.0

Verify

A second agent re-exploits, blind.

5.0

Ship

SARIF, JSON, GitHub Security.

Architecture

Just give it a target.

pwnkit-cli express

Audit an npm package

pwnkit-cli ./my-repo

Review source code

pwnkit-cli https://api.com/chat

Scan an LLM API

pwnkit-cli https://example.com --mode web

Pentest a web app

pwnkit-cli dashboard

Local mission control

pwnkit-cli findings list --severity critical

Triage across scans

Auto-detects target type. Drop the GitHub Action into CI for SARIF output.

Start locally.
Scale when it matters.

Local today. Cloud when it needs to run on a schedule.

npx pwnkit-cli