The harness for Reaper
An autonomous offensive-security harness, measured end to end against all 104 public XBOW web-security benchmarks.
Predator captured 85 of 104 flags on a single fully autonomous pass, with no human in the loop.
That is one run end to end: reconnaissance, exploitation, multi-step chaining, and flag capture, driven by the harness and the model powering Ares with no operator steering it. Of the benchmarks that built and ran on this pass, Predator captured 85 of 87, which is 97.7%, and the capture rate holds across every difficulty tier, the hardest included. A flag counts only when the real planted string is exfiltrated, so detection is never scored as a win.
Predator is the autonomous orchestration harness that drives Reaper. It runs black-box reconnaissance, vulnerability detection, exploitation, multi-step chaining, and flag capture with no human in the loop. This report documents Predator's final run against every public XBOW validation benchmark: a single autonomous pass over all 104 apps, scored under deliberately conservative rules.
On that single pass Predator captured 85 of 104 flags, 81.7%. Counting any benchmark Predator has ever captured across runs, the best-of figure is 96 of 104, 92.3%. We lead with the single-pass number on purpose, because it is what one autonomous run actually produces; best-of is a more generous measure and is labeled as such throughout. Of the benchmarks that built and ran on this pass, Predator captured 85 of 87, which is 97.7%.
This pass ran the full Predator system: the orchestration harness driving the model powering Ares. That pairing is the subject of the design thesis below, where the harness is the constant and the model is swappable. A flag counts only when the real planted string is exfiltrated into a response or finding, never on a reflected payload, so detection is never scored as success.
XBOW publishes the benchmark as 104 original web-security challenges built to evaluate offensive tools and agents across realistic vulnerability classes. Each challenge carries a hidden flag objective rather than a theoretical, finding-only score, and XBOW's own release notes emphasize reproducibility, novelty, and exploit validation as the purpose of the suite.
The harness accrues value; the model is swappable. Predator is that harness.
That thesis is why the harness, not any single model, is the durable asset. In this run the harness drove the model powering Ares through planning, chaining, and converting findings into captured flags across the suite. Holding the harness fixed and changing the model underneath it is what isolates the contribution of each layer, and is the basis for the runs that follow this one.
A scanner reports a finding and stops. Predator treats the finding as the start of an exploit path. For each target it performs reconnaissance, infers authentication state, exercises the vulnerability, chains steps where a single bug is not enough, and proves impact by extracting the planted flag. Findings that cannot be converted into a captured flag are not counted as wins.
Predator completed one full-suite pass across all 104 public XBOW validation benchmarks using its standard xbow_runner.py --redo harness, with out-of-band callback support enabled. Each benchmark is a CTF-style docker-compose web app that plants a real FLAG{...} string. This pass ran the full system, the harness driving the model powering Ares, so the numbers reflect the two together.
The suite is public. The benchmarks are XBOW's open validation set at github.com/xbow-engineering/validation-benchmarks, 104 docker-compose apps anyone can clone and run.
The scoring rule is one line, and it is strict: a benchmark counts only when the real planted FLAG{...} string, read from the container's on-disk flag, env, and sha256, is exfiltrated into a response or finding. A reflected payload, a detected vulnerability, or a plausible-looking guess earns nothing.
Every one of the 104 outcomes, captured, errored, or uncaptured, is listed by ID in the appendix at the end of this report. Clone the suite, run it, and check the rows against your own results.
The path from the full suite to a captured flag has three gates on a single pass. Predator clears them without benchmark-specific answers hard-coded into the harness.
Of the 17 errors on this pass, 11 were benchmarks Predator had captured on other runs, so they are transient rather than capability gaps. Setting aside the 2 legacy images that no longer build, the capture rate on buildable targets is 85 of 102, which is 83.3% on this pass and 96 of 102, which is 94.1% best-of. The single-pass figure stays the headline; the rest is context.
The figures above reflect one representative single pass.
Every benchmark, grouped under a single primary vulnerability tag and ordered by volume, so the rows sum to exactly 104. Each bar's full width is every flag in that category. The crimson share is what Predator captured on this single pass, the amber share is captured on another run but missed this pass to a transient error, and the gray share was never captured. Counts for this pass, best-of, and the category total are printed beside every bar.
XBOW labels each challenge level 1 through 3. A tool that only clears level 1 is pattern matching; one that clears level 3 is reasoning through a real exploit path. Predator's capture rate stays roughly flat across all three levels, and the hardest tier is captured at the same rate as the easiest.
Level 3 best-of is 7 of 8, which is 87.5%, essentially the same as level 1. That is the answer to "you just farmed the easy ones": the hardest tier is captured at the same rate as the easiest. The single level-3 benchmark that errored this pass is a transient docker failure, not a capability ceiling.
Predator performed best on categories that require more than detection: an end-to-end path from a reachable surface to a proven, flag-bearing exploit. Several categories were captured in full. The single-pass figure is lower than best-of only where a transient error blocked a benchmark Predator captures reliably.
The playbooks behind these captures are generic techniques, audited for hard-coded paths and per-target answers. They include a JSFuck-style XSS encoder with onfocus bypasses and an out-of-band victim bot; a php filter-chain generator with log poisoning and alias traversal for file read; JWT tampering, TOCTOU race handling, and a crypto cookie oracle for authentication and privilege escalation; and out-of-band collaborator and phar deserialization for SSRF, XXE, and object injection. Predator chains these without looking up any benchmark's intended solution.
Best-of leaves eight benchmarks uncaptured. They separate cleanly into genuine capability gaps, a few remaining targets, and pure infrastructure, and we keep them apart rather than blurring them together.
The roadmap is specific: add an alternate trigger for phar and object deserialization when gopher is unavailable, build playbooks for the open SSTI, smuggling, and NoSQL cases, repair the two legacy images, and reduce docker and timeout flakiness so the single-pass number converges on best-of. None of this involves benchmark-specific answers.
On a single autonomous pass Predator's 81.7% sits above MAPTA's reported 76.9% and a few points under the XBOW baseline and the two highest community reports. On buildable targets the single-pass figure is 83.3%, which is roughly level with those baselines. The dashed marker shows best-of at 92.3%, which would top the table; we do not claim a win there, because best-of counts the best result per benchmark across multiple runs and is more generous than what most of these public numbers appear to report. The honest read is that Predator is now competitive with the strongest public figures on a single pass and ahead of the closest peer system. None of these comparisons are controlled, and the public methodologies around retries and build handling are not fully disclosed, so the single-pass number is the one to hold onto.
Of these systems, MAPTA is the closest comparator: same 104-scenario benchmark, a multi-agent design, and the same refusal to score detection as success. It also publishes per-category results, which makes a real head-to-head possible. The next section does that, including the one category where Predator still trails.
MAPTA is a multi-agent autonomous web-penetration system published in August 2025 (David et al., arXiv:2508.20816). It separates a coordinator agent, sandbox executors that share a per-assessment Docker container, and a dedicated validation agent that confirms exploits before anything is reported. On the 104-scenario XBOW benchmark it reported a 76.9% overall success rate, and it published per-category figures and operational cost data.
Both systems reject detection as a result. MAPTA enforces this with a validation agent that turns a proof-of-concept into a repeatable exploit before reporting it. Predator enforces it with capture-only scoring: a benchmark counts only when the planted flag is exfiltrated. Two teams, the same conclusion about what a real result is.
MAPTA disclosed cost and efficiency that this Predator run did not: roughly $0.21 per challenge on average and a 143.2-second median solve time, with early-stopping heuristics tied to tool-call budgets. Predator's run instead used a fixed 900-second cap and counted every timeout as a miss. Different priorities, both defensible.
The chart below lines up the six categories both systems report cleanly. Each Predator bar is shown two ways: the solid bar is the single-pass capture rate for that category, on the same all-challenges basis MAPTA reports; the faded extension shows best-of, the rate once any capture across runs is counted. MAPTA appears as a single steel bar. Denominators and run conditions are not identical, so read this as the fairest available alignment, not a controlled match.
This run flips the picture from earlier comparisons. On a single pass Predator now leads MAPTA on four of the six shared categories, ties on SSTI, and trails on only one. The honest reading has four parts.
MAPTA also reports 83% on broken authorization as a single grouped class. Predator's nearest primary tags are reported separately: IDOR at 12 of 12 best-of, privilege escalation at 9 of 9, so they are noted here rather than forced onto the same axis. Predator's leads are real on this run, but all of these comparisons share the same caveat: different denominators, undisclosed retry and build policies, and no controlled head-to-head.
XBOW is model-agnostic by design and evaluates frontier models by running them inside its own agents. In April 2026 it published its read on OpenAI's GPT-5.5, and in May 2026 its read on Anthropic's Mythos Preview. Both belong next to this report, but as a layer comparison, not a scoreboard. XBOW measured models; this report measures a harness, and they were run on different benchmarks.
Frontier models as engines inside XBOW's agents. On their internal benchmark XBOW reported GPT-5.5 cutting the miss rate to 10%, down from GPT-5 at 40% and Opus 4.6 at 18%, the best they had seen. Mythos Preview cut missed vulnerabilities by about 42% against Opus 4.6, and about 55% when handed source code. Both are strongest at reading code.
Predator as a complete harness driving the model powering Ares. It captured 85 of 104 public benchmarks on a single pass by orchestrating recon, exploitation, chaining, and flag capture against live targets. The two sit at different layers. A model like GPT-5.5 or Mythos is something a harness like Predator mounts, not something it races.
XBOW runs every one of these models inside its own harness and picks the best one per job. Its verdict on Anthropic's most capable model is that it is a brain without a body. That is exactly Predator's architecture: the harness is the body, the model is swappable, and the harness is the constant that turns model capability into proven, flag-bearing exploits.
The accurate reading is complementary, not competitive. A stronger model, GPT-5.5 or a Mythos-class model, raises the ceiling on candidate discovery; a harness like Predator converts candidates into validated, flag-bearing exploits against live systems, with a model that is swappable underneath it. Both XBOW evaluations are independent third-party support for the case this report makes: the harness layer is where durable capability accrues. A natural next step is to hold the harness fixed and change the model underneath it, and we will report any such run the same way, single pass first.
Several choices in this run push the headline down rather than up. We list them so the score can be read for what it is.
A more permissive scoring policy would raise every figure in this report. We chose not to apply one, because an offensive-security harness sold into federal and enterprise programs is judged on what it proves, not on what it claims.
Benchmark reports invite a predictable set of objections. Rather than wait for them, here is the boundary in plain terms: the claims this report stands behind, and the claims it deliberately does not make.
If a critique lands in the left column, we agree with it in advance. If it targets the right column, the suite is public and the per-benchmark results are printed below; the fastest rebuttal is to run it.
On a single autonomous pass, Predator captured 85 of 104 flags, which is 81.7%, and 85 of the 87 that built and ran, which is 97.7%. Counting any capture across runs, best-of reaches 96 of 104, which is 92.3%. The strongest categories are the ones that demand an end-to-end exploit path rather than a signature: cross-site scripting, command injection, SQL injection and blind SQL injection, IDOR, privilege escalation, and XXE, several of them captured in full. Capture rate holds across all three difficulty levels, including the hardest.
The remaining gap is narrow and named. Two benchmarks built and ran without yielding a flag, a handful of categories have open targets in SSTI, request smuggling, and NoSQL injection, two legacy images no longer build, and single-pass flakiness costs a few points that best-of recovers. That gives Assail a precise roadmap: alternate deserialization triggers, new playbooks for the open classes, repaired build environments, and lower docker and timeout flakiness so the single-pass figure converges on best-of. None of it requires benchmark-specific answers.
The harness drove the model powering Ares to capture most of the suite on one pass. The model is swappable; the harness is the constant. That is where the value accrues.
All 104 benchmarks, with difficulty, tags, and outcome. The status dot is this single pass: crimson captured, amber a build or run error, gray ran without a captured flag. The best-of column marks any capture across runs. Full disclosure is part of the point; nothing is hidden behind an aggregate.
| ID | Lvl | Tags | This pass | Best-of |
|---|