Assail Reaper, autonomous offensive-security agent
ASSAIL, INC. / TECHNICAL REPORT / XBOW VALIDATION SUITE / JUNE 2026

The harness for Reaper

Predator

An autonomous offensive-security harness, measured end to end against all 104 public XBOW web-security benchmarks.

85 / 104
Flags captured, single pass
81.7%
Single-pass capture rate
92.3%
Best-of, ever captured
The result to sit with

Predator captured 85 of 104 flags on a single fully autonomous pass, with no human in the loop.

That is one run end to end: reconnaissance, exploitation, multi-step chaining, and flag capture, driven by the harness and the model powering Ares with no operator steering it. Of the benchmarks that built and ran on this pass, Predator captured 85 of 87, which is 97.7%, and the capture rate holds across every difficulty tier, the hardest included. A flag counts only when the real planted string is exfiltrated, so detection is never scored as a win.

01  Abstract

Predator is the autonomous orchestration harness that drives Reaper. It runs black-box reconnaissance, vulnerability detection, exploitation, multi-step chaining, and flag capture with no human in the loop. This report documents Predator's final run against every public XBOW validation benchmark: a single autonomous pass over all 104 apps, scored under deliberately conservative rules.

On that single pass Predator captured 85 of 104 flags, 81.7%. Counting any benchmark Predator has ever captured across runs, the best-of figure is 96 of 104, 92.3%. We lead with the single-pass number on purpose, because it is what one autonomous run actually produces; best-of is a more generous measure and is labeled as such throughout. Of the benchmarks that built and ran on this pass, Predator captured 85 of 87, which is 97.7%.

This pass ran the full Predator system: the orchestration harness driving the model powering Ares. That pairing is the subject of the design thesis below, where the harness is the constant and the model is swappable. A flag counts only when the real planted string is exfiltrated into a response or finding, never on a reflected payload, so detection is never scored as success.

XBOW publishes the benchmark as 104 original web-security challenges built to evaluate offensive tools and agents across realistic vulnerability classes. Each challenge carries a hidden flag objective rather than a theoretical, finding-only score, and XBOW's own release notes emphasize reproducibility, novelty, and exploit validation as the purpose of the suite.

The harness accrues value; the model is swappable. Predator is that harness.

That thesis is why the harness, not any single model, is the durable asset. In this run the harness drove the model powering Ares through planning, chaining, and converting findings into captured flags across the suite. Holding the harness fixed and changing the model underneath it is what isolates the contribution of each layer, and is the basis for the runs that follow this one.

02  What the harness does

An autonomous capture loop, not a scanner

A scanner reports a finding and stops. Predator treats the finding as the start of an exploit path. For each target it performs reconnaissance, infers authentication state, exercises the vulnerability, chains steps where a single bug is not enough, and proves impact by extracting the planted flag. Findings that cannot be converted into a captured flag are not counted as wins.

03  Methodology

How the run was scored

Predator completed one full-suite pass across all 104 public XBOW validation benchmarks using its standard xbow_runner.py --redo harness, with out-of-band callback support enabled. Each benchmark is a CTF-style docker-compose web app that plants a real FLAG{...} string. This pass ran the full system, the harness driving the model powering Ares, so the numbers reflect the two together.

Verify it yourself

The suite is public. The benchmarks are XBOW's open validation set at github.com/xbow-engineering/validation-benchmarks, 104 docker-compose apps anyone can clone and run.

The scoring rule is one line, and it is strict: a benchmark counts only when the real planted FLAG{...} string, read from the container's on-disk flag, env, and sha256, is exfiltrated into a response or finding. A reflected payload, a detected vulnerability, or a plausible-looking guess earns nothing.

Every one of the 104 outcomes, captured, errored, or uncaptured, is listed by ID in the appendix at the end of this report. Clone the suite, run it, and check the rows against your own results.

04  Overall results

From suite to capture

The path from the full suite to a captured flag has three gates on a single pass. Predator clears them without benchmark-specific answers hard-coded into the harness.

81.7%
Single pass, all possible
85 of 104 on one autonomous pass, with every error left in the denominator. The headline number.
97.7%
Of what ran this pass
85 of the 87 benchmarks that actually built and ran. Pure capability with transient infrastructure removed.
92.3%
Best-of, ever captured
96 of 104 captured in some run, deduplicated. A more generous measure, shown for context.

Of the 17 errors on this pass, 11 were benchmarks Predator had captured on other runs, so they are transient rather than capability gaps. Setting aside the 2 legacy images that no longer build, the capture rate on buildable targets is 85 of 102, which is 83.3% on this pass and 96 of 102, which is 94.1% best-of. The single-pass figure stays the headline; the rest is context.

The figures above reflect one representative single pass.

05  Category scorecard

Where the flags came from

Every benchmark, grouped under a single primary vulnerability tag and ordered by volume, so the rows sum to exactly 104. Each bar's full width is every flag in that category. The crimson share is what Predator captured on this single pass, the amber share is captured on another run but missed this pass to a transient error, and the gray share was never captured. Counts for this pass, best-of, and the category total are printed beside every bar.

Captured this pass Captured on another run Never captured
Capture share →
Pass
Best
Total
06  By difficulty

It holds up at the hard end

XBOW labels each challenge level 1 through 3. A tool that only clears level 1 is pattern matching; one that clears level 3 is reasoning through a real exploit path. Predator's capture rate stays roughly flat across all three levels, and the hardest tier is captured at the same rate as the easiest.

Level 3 best-of is 7 of 8, which is 87.5%, essentially the same as level 1. That is the answer to "you just farmed the easy ones": the hardest tier is captured at the same rate as the easiest. The single level-3 benchmark that errored this pass is a transient docker failure, not a capability ceiling.

07  Capability highlights

Strong where chaining matters

Predator performed best on categories that require more than detection: an end-to-end path from a reachable surface to a proven, flag-bearing exploit. Several categories were captured in full. The single-pass figure is lower than best-of only where a transient error blocked a benchmark Predator captures reliably.

The playbooks behind these captures are generic techniques, audited for hard-coded paths and per-target answers. They include a JSFuck-style XSS encoder with onfocus bypasses and an out-of-band victim bot; a php filter-chain generator with log poisoning and alias traversal for file read; JWT tampering, TOCTOU race handling, and a crypto cookie oracle for authentication and privilege escalation; and out-of-band collaborator and phar deserialization for SSRF, XXE, and object injection. Predator chains these without looking up any benchmark's intended solution.

08  Room to improve

Where flags are still on the table

Best-of leaves eight benchmarks uncaptured. They separate cleanly into genuine capability gaps, a few remaining targets, and pure infrastructure, and we keep them apart rather than blurring them together.

The roadmap is specific: add an alternate trigger for phar and object deserialization when gopher is unavailable, build playbooks for the open SSTI, smuggling, and NoSQL cases, repair the two legacy images, and reduce docker and timeout flakiness so the single-pass number converges on best-of. None of this involves benchmark-specific answers.

09  Public comparison context

Context, not a head-to-head

Read this first These figures are not an apples-to-apples comparison. Public reports differ in runtime limits, infrastructure, allowed tooling, retries, and whether build failures are counted at all. Predator's primary bar is the single-pass, all-possible figure, with every error left in the denominator. The dashed marker is best-of, a more generous metric. Treat the other systems as context, not as a controlled measurement against Predator.

On a single autonomous pass Predator's 81.7% sits above MAPTA's reported 76.9% and a few points under the XBOW baseline and the two highest community reports. On buildable targets the single-pass figure is 83.3%, which is roughly level with those baselines. The dashed marker shows best-of at 92.3%, which would top the table; we do not claim a win there, because best-of counts the best result per benchmark across multiple runs and is more generous than what most of these public numbers appear to report. The honest read is that Predator is now competitive with the strongest public figures on a single pass and ahead of the closest peer system. None of these comparisons are controlled, and the public methodologies around retries and build handling are not fully disclosed, so the single-pass number is the one to hold onto.

Of these systems, MAPTA is the closest comparator: same 104-scenario benchmark, a multi-agent design, and the same refusal to score detection as success. It also publishes per-category results, which makes a real head-to-head possible. The next section does that, including the one category where Predator still trails.

10  Predator vs MAPTA

The closest comparator, category by category

MAPTA is a multi-agent autonomous web-penetration system published in August 2025 (David et al., arXiv:2508.20816). It separates a coordinator agent, sandbox executors that share a per-assessment Docker container, and a dedicated validation agent that confirms exploits before anything is reported. On the 104-scenario XBOW benchmark it reported a 76.9% overall success rate, and it published per-category figures and operational cost data.

Shared design philosophy

Both systems reject detection as a result. MAPTA enforces this with a validation agent that turns a proof-of-concept into a repeatable exploit before reporting it. Predator enforces it with capture-only scoring: a benchmark counts only when the planted flag is exfiltrated. Two teams, the same conclusion about what a real result is.

Where they differ

MAPTA disclosed cost and efficiency that this Predator run did not: roughly $0.21 per challenge on average and a 143.2-second median solve time, with early-stopping heuristics tied to tool-call budgets. Predator's run instead used a fixed 900-second cap and counted every timeout as a miss. Different priorities, both defensible.

The chart below lines up the six categories both systems report cleanly. Each Predator bar is shown two ways: the solid bar is the single-pass capture rate for that category, on the same all-challenges basis MAPTA reports; the faded extension shows best-of, the rate once any capture across runs is counted. MAPTA appears as a single steel bar. Denominators and run conditions are not identical, so read this as the fairest available alignment, not a controlled match.

Predator, single pass Predator, best-of MAPTA, reported success

What the head-to-head shows

This run flips the picture from earlier comparisons. On a single pass Predator now leads MAPTA on four of the six shared categories, ties on SSTI, and trails on only one. The honest reading has four parts.

MAPTA also reports 83% on broken authorization as a single grouped class. Predator's nearest primary tags are reported separately: IDOR at 12 of 12 best-of, privilege escalation at 9 of 9, so they are noted here rather than forced onto the same axis. Predator's leads are real on this run, but all of these comparisons share the same caveat: different denominators, undisclosed retry and build policies, and no controlled head-to-head.

11  Predator and the frontier models

Models, not rival harnesses

XBOW is model-agnostic by design and evaluates frontier models by running them inside its own agents. In April 2026 it published its read on OpenAI's GPT-5.5, and in May 2026 its read on Anthropic's Mythos Preview. Both belong next to this report, but as a layer comparison, not a scoreboard. XBOW measured models; this report measures a harness, and they were run on different benchmarks.

Read this first XBOW evaluated GPT-5.5 and Mythos Preview as models on its internal benchmark of frozen open-source applications, where the primary metric is miss rate, the share of known vulnerabilities the model fails to find while driving XBOW's agents. Predator's numbers in this report come from the public 104-challenge XBOW validation suite, scored on flag exfiltration. Different benchmarks, different metrics, and models versus a harness. A model's miss rate is not Predator's capture rate; we do not place them on the same axis or invent a shared number.
What XBOW measured

Frontier models as engines inside XBOW's agents. On their internal benchmark XBOW reported GPT-5.5 cutting the miss rate to 10%, down from GPT-5 at 40% and Opus 4.6 at 18%, the best they had seen. Mythos Preview cut missed vulnerabilities by about 42% against Opus 4.6, and about 55% when handed source code. Both are strongest at reading code.

What this report measures

Predator as a complete harness driving the model powering Ares. It captured 85 of 104 public benchmarks on a single pass by orchestrating recon, exploitation, chaining, and flag capture against live targets. The two sit at different layers. A model like GPT-5.5 or Mythos is something a harness like Predator mounts, not something it races.

What XBOW reported about GPT-5.5

What XBOW reported about Mythos Preview

XBOW runs every one of these models inside its own harness and picks the best one per job. Its verdict on Anthropic's most capable model is that it is a brain without a body. That is exactly Predator's architecture: the harness is the body, the model is swappable, and the harness is the constant that turns model capability into proven, flag-bearing exploits.

The accurate reading is complementary, not competitive. A stronger model, GPT-5.5 or a Mythos-class model, raises the ceiling on candidate discovery; a harness like Predator converts candidates into validated, flag-bearing exploits against live systems, with a model that is swappable underneath it. Both XBOW evaluations are independent third-party support for the case this report makes: the harness layer is where durable capability accrues. A natural next step is to hold the harness fixed and change the model underneath it, and we will report any such run the same way, single pass first.

12  Fairness notes

Why this score is conservative on purpose

Several choices in this run push the headline down rather than up. We list them so the score can be read for what it is.

A more permissive scoring policy would raise every figure in this report. We chose not to apply one, because an offensive-security harness sold into federal and enterprise programs is judged on what it proves, not on what it claims.

13  What we claim, and what we don't

Reading this before you argue with it

Benchmark reports invite a predictable set of objections. Rather than wait for them, here is the boundary in plain terms: the claims this report stands behind, and the claims it deliberately does not make.

What we are not claiming

Don't read these in
  • That Predator beats GPT-5.5, Opus 4.8, MAPTA, or any model or system. Those were measured on different benchmarks, and a model and a harness are different layers.
  • That 92.3% is the headline. Best-of is the generous figure and is labeled as such everywhere. The single-pass 81.7% leads.
  • A controlled model-versus-model result. We did not run that comparison, and we do not imply one.
  • That the suite is solved. Eight benchmarks are uncaptured, named by ID in the appendix.
  • That the frontier models discussed are shipping products. They were preview or early-access at the time of XBOW's evaluations.
  • That the harness ran without a model. This run drove the model powering Ares; a model-free run is separate work and is not reported here.

What this report claims

On the record
  • On one fully autonomous pass, with no human in the loop, Predator captured 85 of 104, and 85 of the 87 that built and ran, which is 97.7%.
  • Every build and run error stays in the all-possible denominator; the headline is not inflated by dropping failures.
  • The playbooks are generic across 22 vulnerability classes, audited against hard-coded paths and per-target answers.
  • Capture rate holds across all three difficulty levels, the hardest tier included.
  • Scoring is flag-string exfiltration, fully automated, on a public suite anyone can clone and rerun.

If a critique lands in the left column, we agree with it in advance. If it targets the right column, the suite is public and the per-benchmark results are printed below; the fastest rebuttal is to run it.

14  Bottom line

A strong result, honestly bounded

On a single autonomous pass, Predator captured 85 of 104 flags, which is 81.7%, and 85 of the 87 that built and ran, which is 97.7%. Counting any capture across runs, best-of reaches 96 of 104, which is 92.3%. The strongest categories are the ones that demand an end-to-end exploit path rather than a signature: cross-site scripting, command injection, SQL injection and blind SQL injection, IDOR, privilege escalation, and XXE, several of them captured in full. Capture rate holds across all three difficulty levels, including the hardest.

The remaining gap is narrow and named. Two benchmarks built and ran without yielding a flag, a handful of categories have open targets in SSTI, request smuggling, and NoSQL injection, two legacy images no longer build, and single-pass flakiness costs a few points that best-of recovers. That gives Assail a precise roadmap: alternate deserialization triggers, new playbooks for the open classes, repaired build environments, and lower docker and timeout flakiness so the single-pass figure converges on best-of. None of it requires benchmark-specific answers.

The harness drove the model powering Ares to capture most of the suite on one pass. The model is swappable; the harness is the constant. That is where the value accrues.

15  Appendix: per-benchmark results

Every benchmark, on the record

All 104 benchmarks, with difficulty, tags, and outcome. The status dot is this single pass: crimson captured, amber a build or run error, gray ran without a captured flag. The best-of column marks any capture across runs. Full disclosure is part of the point; nothing is hidden behind an aggregate.

Captured this pass Build or run error Ran, no capture
IDLvlTagsThis passBest-of