Predator: An Autonomous Offensive-Security Harness for Reaper

01 Abstract

Predator is the autonomous orchestration harness that drives Reaper. It runs black-box reconnaissance, vulnerability detection, exploitation, multi-step chaining, and flag capture with no human in the loop. This report documents Predator's final run against every public XBOW validation benchmark: a single autonomous pass over all 104 apps, scored under deliberately conservative rules.

On that single pass Predator captured 85 of 104 flags, 81.7%. Counting any benchmark Predator has ever captured across runs, the best-of figure is 96 of 104, 92.3%. We lead with the single-pass number on purpose, because it is what one autonomous run actually produces; best-of is a more generous measure and is labeled as such throughout. Of the benchmarks that built and ran on this pass, Predator captured 85 of 87, which is 97.7%.

This pass ran the full Predator system: the orchestration harness driving the model powering Ares. That pairing is the subject of the design thesis below, where the harness is the constant and the model is swappable. A flag counts only when the real planted string is exfiltrated into a response or finding, never on a reflected payload, so detection is never scored as success.

XBOW publishes the benchmark as 104 original web-security challenges built to evaluate offensive tools and agents across realistic vulnerability classes. Each challenge carries a hidden flag objective rather than a theoretical, finding-only score, and XBOW's own release notes emphasize reproducibility, novelty, and exploit validation as the purpose of the suite.

The harness accrues value; the model is swappable. Predator is that harness.

That thesis is why the harness, not any single model, is the durable asset. In this run the harness drove the model powering Ares through planning, chaining, and converting findings into captured flags across the suite. Holding the harness fixed and changing the model underneath it is what isolates the contribution of each layer, and is the basis for the runs that follow this one.

02 What the harness does

An autonomous capture loop, not a scanner

A scanner reports a finding and stops. Predator treats the finding as the start of an exploit path. For each target it performs reconnaissance, infers authentication state, exercises the vulnerability, chains steps where a single bug is not enough, and proves impact by extracting the planted flag. Findings that cannot be converted into a captured flag are not counted as wins.

03 Methodology

How the run was scored

Predator completed one full-suite pass across all 104 public XBOW validation benchmarks using its standard xbow_runner.py --redo harness, with out-of-band callback support enabled. Each benchmark is a CTF-style docker-compose web app that plants a real FLAG{...} string. This pass ran the full system, the harness driving the model powering Ares, so the numbers reflect the two together.

All 104 benchmark apps were attempted; each was built and run through Docker Compose, then taken through recon, browser crawl, detection, exploitation, chaining, and generic per-class exfil playbooks.
A benchmark counts as captured only when the actual planted flag string, read from the container's on-disk flag, env, and sha256, appears in a response or finding. A reflected payload earns no credit.
Single-pass is one --redo run, the headline number. Best-of is any capture for a benchmark across all runs, deduplicated, and is always labeled as the more generous measure.
Build and run errors on this pass stay in the all-possible denominator. They split into transient docker networking and port-bind failures, scan timeouts beyond 900 seconds on slow endpoints, and two legacy images with rotted build environments.
All playbooks are generic techniques audited for hard-coded paths or per-target answers. Build-environment repairs, such as apt or base-image bumps, are kept separate from the vulnerability surface.
The scorecard groups each benchmark under a single primary tag, so category totals sum to exactly 104 and nothing is counted twice.

Verify it yourself

The suite is public. The benchmarks are XBOW's open validation set at github.com/xbow-engineering/validation-benchmarks, 104 docker-compose apps anyone can clone and run.

The scoring rule is one line, and it is strict: a benchmark counts only when the real planted FLAG{...} string, read from the container's on-disk flag, env, and sha256, is exfiltrated into a response or finding. A reflected payload, a detected vulnerability, or a plausible-looking guess earns nothing.

Every one of the 104 outcomes, captured, errored, or uncaptured, is listed by ID in the appendix at the end of this report. Clone the suite, run it, and check the rows against your own results.

04 Overall results

From suite to capture

The path from the full suite to a captured flag has three gates on a single pass. Predator clears them without benchmark-specific answers hard-coded into the harness.

Total benchmarks104

The complete public XBOW validation suite.

Built and ran this pass87

17 errored on this pass: 8 transient docker, 7 scan timeouts past 900s, 2 legacy images that no longer build.

Flags captured, single pass85

85 of the 87 that ran, which is 97.7%. A flag counts only when the real planted string is exfiltrated.

81.7%

Single pass, all possible

85 of 104 on one autonomous pass, with every error left in the denominator. The headline number.

97.7%

Of what ran this pass

85 of the 87 benchmarks that actually built and ran. Pure capability with transient infrastructure removed.

92.3%

Best-of, ever captured

96 of 104 captured in some run, deduplicated. A more generous measure, shown for context.

Of the 17 errors on this pass, 11 were benchmarks Predator had captured on other runs, so they are transient rather than capability gaps. Setting aside the 2 legacy images that no longer build, the capture rate on buildable targets is 85 of 102, which is 83.3% on this pass and 96 of 102, which is 94.1% best-of. The single-pass figure stays the headline; the rest is context.

The figures above reflect one representative single pass.

05 Category scorecard

Where the flags came from

Every benchmark, grouped under a single primary vulnerability tag and ordered by volume, so the rows sum to exactly 104. Each bar's full width is every flag in that category. The crimson share is what Predator captured on this single pass, the amber share is captured on another run but missed this pass to a transient error, and the gray share was never captured. Counts for this pass, best-of, and the category total are printed beside every bar.

Captured this pass Captured on another run Never captured

Capture share →

Pass

Best

Total

06 By difficulty

It holds up at the hard end

XBOW labels each challenge level 1 through 3. A tool that only clears level 1 is pattern matching; one that clears level 3 is reasoning through a real exploit path. Predator's capture rate stays roughly flat across all three levels, and the hardest tier is captured at the same rate as the easiest.

Level 1

38 / 45

single pass · 41 best-of

Level 2

40 / 51

single pass · 48 best-of

Level 3

7 / 8

single pass and best-of

Level 3 best-of is 7 of 8, which is 87.5%, essentially the same as level 1. That is the answer to "you just farmed the easy ones": the hardest tier is captured at the same rate as the easiest. The single level-3 benchmark that errored this pass is a transient docker failure, not a capability ceiling.

07 Capability highlights

Strong where chaining matters

Predator performed best on categories that require more than detection: an end-to-end path from a reachable surface to a proven, flag-bearing exploit. Several categories were captured in full. The single-pass figure is lower than best-of only where a transient error blocked a benchmark Predator captures reliably.

Cross-site scripting · 23 / 23 best-of, 17 this pass
Every XSS benchmark captured across runs. The six not captured this pass were transient docker failures and timeouts, not misses.
IDOR · 12 / 12 best-of, 11 this pass
Object-reference abuse across authenticated state, including MongoDB ObjectId forging and authenticated walks.
Privilege escalation · 9 / 9, every pass
Login-bypass battery, mass assignment, and broken-authorization flows, captured on the single pass as well.
Command injection · 8 / 8 best-of, 7 this pass
Output-shaping, Struts OGNL and SpEL injection, and blind out-of-band execution.
SQL injection · 4 / 4, plus 3 / 3 blind best-of
Error-based, boolean and time-blind, UNION, and second-order auth bypass into a login chain.
XXE · 3 / 3
Including authenticated SOAP and upload-to-XXE paths.
SSTI · 11 / 13 best-of
Multi-engine template injection across Jinja2, Twig, Smarty, FreeMarker, Velocity, and ERB, with two cases still open.

The playbooks behind these captures are generic techniques, audited for hard-coded paths and per-target answers. They include a JSFuck-style XSS encoder with onfocus bypasses and an out-of-band victim bot; a php filter-chain generator with log poisoning and alias traversal for file read; JWT tampering, TOCTOU race handling, and a crypto cookie oracle for authentication and privilege escalation; and out-of-band collaborator and phar deserialization for SSRF, XXE, and object injection. Predator chains these without looking up any benchmark's intended solution.

08 Room to improve

Where flags are still on the table

Best-of leaves eight benchmarks uncaptured. They separate cleanly into genuine capability gaps, a few remaining targets, and pure infrastructure, and we keep them apart rather than blurring them together.

Built, ran, no flag · 2 · XBEN-092-24, XBEN-098-24
The clearest true gaps. XBEN-092-24 is a phar deserialization whose gopher trigger is blocked in this environment and needs an alternate path; XBEN-098-24 is a layered LFI and disclosure chain that did not convert.
Never captured across any run · SSTI x2, smuggling_desync, nosqli
Two SSTI variants, the single request-smuggling benchmark, and the single NoSQL injection benchmark remain open targets for new generic playbooks.
Legacy images that no longer build · 2 · XBEN-026-24, XBEN-100-24
Environment rot in old base images, not a capability question. Fixed by repairing the build environment, which we keep separate from the vulnerability surface.
Single-pass flakiness · 8 docker, 7 timeouts
Eleven of the 17 errors this pass were benchmarks captured on other runs. They cost single-pass score, not capability. Lower docker flakiness and a tighter timeout budget recover most of them.

The roadmap is specific: add an alternate trigger for phar and object deserialization when gopher is unavailable, build playbooks for the open SSTI, smuggling, and NoSQL cases, repair the two legacy images, and reduce docker and timeout flakiness so the single-pass number converges on best-of. None of this involves benchmark-specific answers.

09 Public comparison context

Context, not a head-to-head

Read this first These figures are not an apples-to-apples comparison. Public reports differ in runtime limits, infrastructure, allowed tooling, retries, and whether build failures are counted at all. Predator's primary bar is the single-pass, all-possible figure, with every error left in the denominator. The dashed marker is best-of, a more generous metric. Treat the other systems as context, not as a controlled measurement against Predator.

Xfenser AI

88.5%

XBOW official baseline

85%

Cyber-AutoAgent

84.6%

Predator, single pass

81.7%

MAPTA

76.9%

On a single autonomous pass Predator's 81.7% sits above MAPTA's reported 76.9% and a few points under the XBOW baseline and the two highest community reports. On buildable targets the single-pass figure is 83.3%, which is roughly level with those baselines. The dashed marker shows best-of at 92.3%, which would top the table; we do not claim a win there, because best-of counts the best result per benchmark across multiple runs and is more generous than what most of these public numbers appear to report. The honest read is that Predator is now competitive with the strongest public figures on a single pass and ahead of the closest peer system. None of these comparisons are controlled, and the public methodologies around retries and build handling are not fully disclosed, so the single-pass number is the one to hold onto.

Of these systems, MAPTA is the closest comparator: same 104-scenario benchmark, a multi-agent design, and the same refusal to score detection as success. It also publishes per-category results, which makes a real head-to-head possible. The next section does that, including the one category where Predator still trails.

10 Predator vs MAPTA

The closest comparator, category by category

MAPTA is a multi-agent autonomous web-penetration system published in August 2025 (David et al., arXiv:2508.20816). It separates a coordinator agent, sandbox executors that share a per-assessment Docker container, and a dedicated validation agent that confirms exploits before anything is reported. On the 104-scenario XBOW benchmark it reported a 76.9% overall success rate, and it published per-category figures and operational cost data.

Shared design philosophy

Both systems reject detection as a result. MAPTA enforces this with a validation agent that turns a proof-of-concept into a repeatable exploit before reporting it. Predator enforces it with capture-only scoring: a benchmark counts only when the planted flag is exfiltrated. Two teams, the same conclusion about what a real result is.

Where they differ

MAPTA disclosed cost and efficiency that this Predator run did not: roughly $0.21 per challenge on average and a 143.2-second median solve time, with early-stopping heuristics tied to tool-call budgets. Predator's run instead used a fixed 900-second cap and counted every timeout as a miss. Different priorities, both defensible.

The chart below lines up the six categories both systems report cleanly. Each Predator bar is shown two ways: the solid bar is the single-pass capture rate for that category, on the same all-challenges basis MAPTA reports; the faded extension shows best-of, the rate once any capture across runs is counted. MAPTA appears as a single steel bar. Denominators and run conditions are not identical, so read this as the fairest available alignment, not a controlled match.

Predator, single pass Predator, best-of MAPTA, reported success

What the head-to-head shows

This run flips the picture from earlier comparisons. On a single pass Predator now leads MAPTA on four of the six shared categories, ties on SSTI, and trails on only one. The honest reading has four parts.

XSS reverses. · Predator 73.9% single, 100% best-of, MAPTA 57%
Once a shared weakness, XSS is now a Predator strength. Every XSS benchmark is captured best-of, and the single-pass figure already clears MAPTA's reported rate.
Command injection and SQL injection lead. · cmd 87.5% single vs 75%, SQLi 100% vs 83%
Both ahead of MAPTA on the single pass, and both reach full capture best-of.
Blind SQL injection stays a standout. · Predator 66.7% single, 100% best-of, MAPTA 0%
MAPTA's own paper records 0% on blind SQLi and names it a weakness. Predator captures all three best-of.
SSRF is the one trailing category. · Predator 33.3% single, 66.7% best-of, MAPTA 100%
MAPTA captures every SSRF case; Predator reaches two of three best-of and one on this pass. This is the clearest place MAPTA is still ahead, and we say so.

MAPTA also reports 83% on broken authorization as a single grouped class. Predator's nearest primary tags are reported separately: IDOR at 12 of 12 best-of, privilege escalation at 9 of 9, so they are noted here rather than forced onto the same axis. Predator's leads are real on this run, but all of these comparisons share the same caveat: different denominators, undisclosed retry and build policies, and no controlled head-to-head.

11 Predator and the frontier models

Models, not rival harnesses

XBOW is model-agnostic by design and evaluates frontier models by running them inside its own agents. In April 2026 it published its read on OpenAI's GPT-5.5, and in May 2026 its read on Anthropic's Mythos Preview. Both belong next to this report, but as a layer comparison, not a scoreboard. XBOW measured models; this report measures a harness, and they were run on different benchmarks.

Read this first XBOW evaluated GPT-5.5 and Mythos Preview as models on its internal benchmark of frozen open-source applications, where the primary metric is miss rate, the share of known vulnerabilities the model fails to find while driving XBOW's agents. Predator's numbers in this report come from the public 104-challenge XBOW validation suite, scored on flag exfiltration. Different benchmarks, different metrics, and models versus a harness. A model's miss rate is not Predator's capture rate; we do not place them on the same axis or invent a shared number.

What XBOW measured

Frontier models as engines inside XBOW's agents. On their internal benchmark XBOW reported GPT-5.5 cutting the miss rate to 10%, down from GPT-5 at 40% and Opus 4.6 at 18%, the best they had seen. Mythos Preview cut missed vulnerabilities by about 42% against Opus 4.6, and about 55% when handed source code. Both are strongest at reading code.

What this report measures

Predator as a complete harness driving the model powering Ares. It captured 85 of 104 public benchmarks on a single pass by orchestrating recon, exploitation, chaining, and flag capture against live targets. The two sit at different layers. A model like GPT-5.5 or Mythos is something a harness like Predator mounts, not something it races.

What XBOW reported about GPT-5.5

On XBOW's internal benchmark, GPT-5.5 reached a 10% miss rate, their best result to date, versus GPT-5 at 40% and Opus 4.6 at 18%. Read carefully: that is find-or-miss of a known vulnerability inside XBOW's harness, not Predator's flag-capture rate on the public suite.
Black box without source code already beat GPT-5 working with source code, and in white box testing XBOW described the model as effectively saturating their benchmark.
On computer-use tasks it scored 97.5% on visual acuity, near the best XBOW had seen, logged into targets in roughly half the iterations of the next best model, and failed faster when blocked.
It also over-persisted on hopeless paths about half as often as previous GPT versions or Opus, which XBOW framed as a practical, not just capability, gain.

What XBOW reported about Mythos Preview

It reduced missed vulnerabilities by roughly 42% versus Opus 4.6, and roughly 55% with source code supplied, with its standout strength in reading code, native-code discovery, and reverse engineering.
On XBOW's command-safety judgment test it scored 77.8%, below Opus 4.6 at 81.2% and Haiku 4.5 at 90.1%, favoring the letter of a rule over its spirit.
Anthropic indicated Mythos would price around five times an Opus model. XBOW found it powerful but not best-in-class once accuracy is normalized by cost, and an early-access preview not yet on public APIs.

XBOW runs every one of these models inside its own harness and picks the best one per job. Its verdict on Anthropic's most capable model is that it is a brain without a body. That is exactly Predator's architecture: the harness is the body, the model is swappable, and the harness is the constant that turns model capability into proven, flag-bearing exploits.

The accurate reading is complementary, not competitive. A stronger model, GPT-5.5 or a Mythos-class model, raises the ceiling on candidate discovery; a harness like Predator converts candidates into validated, flag-bearing exploits against live systems, with a model that is swappable underneath it. Both XBOW evaluations are independent third-party support for the case this report makes: the harness layer is where durable capability accrues. A natural next step is to hold the harness fixed and change the model underneath it, and we will report any such run the same way, single pass first.

12 Fairness notes

Why this score is conservative on purpose

Several choices in this run push the headline down rather than up. We list them so the score can be read for what it is.

The headline is single-pass, one autonomous run. Best-of, which counts any capture across runs, is always labeled as the more generous measure and never used as the headline.
Every build and run error stays in the all-possible denominator, including scan timeouts past 900 seconds and the two legacy images that no longer build.
Eleven of the 17 errors on this pass were benchmarks captured on other runs, so they are transient. We report them as errors anyway rather than quietly removing them.
Detection is never treated as success. A flag counts only when the real planted string is exfiltrated, never on a reflected payload.
Categories use a single primary tag, so the rows sum to exactly 104 and no benchmark is counted twice. This is a deliberate change from looser multi-tag accounting that can overstate totals.
Public competitor scores are included only as context, never as controlled head-to-head measurements.

A more permissive scoring policy would raise every figure in this report. We chose not to apply one, because an offensive-security harness sold into federal and enterprise programs is judged on what it proves, not on what it claims.

13 What we claim, and what we don't

Reading this before you argue with it

Benchmark reports invite a predictable set of objections. Rather than wait for them, here is the boundary in plain terms: the claims this report stands behind, and the claims it deliberately does not make.

What we are not claiming

Don't read these in

That Predator beats GPT-5.5, Opus 4.8, MAPTA, or any model or system. Those were measured on different benchmarks, and a model and a harness are different layers.
That 92.3% is the headline. Best-of is the generous figure and is labeled as such everywhere. The single-pass 81.7% leads.
A controlled model-versus-model result. We did not run that comparison, and we do not imply one.
That the suite is solved. Eight benchmarks are uncaptured, named by ID in the appendix.
That the frontier models discussed are shipping products. They were preview or early-access at the time of XBOW's evaluations.
That the harness ran without a model. This run drove the model powering Ares; a model-free run is separate work and is not reported here.

What this report claims

On the record

On one fully autonomous pass, with no human in the loop, Predator captured 85 of 104, and 85 of the 87 that built and ran, which is 97.7%.
Every build and run error stays in the all-possible denominator; the headline is not inflated by dropping failures.
The playbooks are generic across 22 vulnerability classes, audited against hard-coded paths and per-target answers.
Capture rate holds across all three difficulty levels, the hardest tier included.
Scoring is flag-string exfiltration, fully automated, on a public suite anyone can clone and rerun.

If a critique lands in the left column, we agree with it in advance. If it targets the right column, the suite is public and the per-benchmark results are printed below; the fastest rebuttal is to run it.

14 Bottom line

A strong result, honestly bounded

On a single autonomous pass, Predator captured 85 of 104 flags, which is 81.7%, and 85 of the 87 that built and ran, which is 97.7%. Counting any capture across runs, best-of reaches 96 of 104, which is 92.3%. The strongest categories are the ones that demand an end-to-end exploit path rather than a signature: cross-site scripting, command injection, SQL injection and blind SQL injection, IDOR, privilege escalation, and XXE, several of them captured in full. Capture rate holds across all three difficulty levels, including the hardest.

The remaining gap is narrow and named. Two benchmarks built and ran without yielding a flag, a handful of categories have open targets in SSTI, request smuggling, and NoSQL injection, two legacy images no longer build, and single-pass flakiness costs a few points that best-of recovers. That gives Assail a precise roadmap: alternate deserialization triggers, new playbooks for the open classes, repaired build environments, and lower docker and timeout flakiness so the single-pass figure converges on best-of. None of it requires benchmark-specific answers.

The harness drove the model powering Ares to capture most of the suite on one pass. The model is swappable; the harness is the constant. That is where the value accrues.

15 Appendix: per-benchmark results

Every benchmark, on the record

All 104 benchmarks, with difficulty, tags, and outcome. The status dot is this single pass: crimson captured, amber a build or run error, gray ran without a captured flag. The best-of column marks any capture across runs. Full disclosure is part of the point; nothing is hidden behind an aggregate.

Captured this pass Build or run error Ran, no capture

ID	Lvl	Tags	This pass	Best-of