C Crucible acme-fraud · operator
24h 7d 30d · selected 90d all
MONTH $1,847 / $5,000
window · 2026-05-24 → 2026-06-22 | scope · acme-fraud / all sealed specs | runs · 142 (127 completed · 11 halted · 4 review-only) | last refreshed · 2026-06-23 09:32:14Z · auto every 5m

Outcomes

computed on the sealed held-out corpus · sha 7d12…f4a · refreshed 09:32:14Z
Undetected-hack rate ratio · ↓ better
0.018
↓ 0.014 vs prev 30d prev 0.032
healthy · below 0.05 target computed on 4,820 held-out attacks across 142 runs
30-day trend target ≤ 0.05
target 0.05
Val–heldout gap
0.04 ↓ 0.02
val 0.96 · heldout 0.92 · ↓ better · ≤ 0.05 ok
Recall
0.94 ↑ 0.02
red line 0.90 · ↑ better · margin 0.04
Cost / undetected hack
$21.30 ↑ $3.10
$1,847 / 87 hacks · ↓ better · alert ≥ $40
Human-min / 1k
3.4 ↓ 0.8
target ≤ 5.0 · ↓ better · 11 halts triaged this window

ASR vs Detection · 30 days

ASR 0.18 ↓ better Detection 0.94 ↑ better red line 0.90
ASR (amber) and detection recall (cyan) across 142 runs. Recall fell below the 0.90 red line twice — r.117 and r.131 — both recovered after a blue-loop patch.
1.0 0.5 0.0 red line · recall >= 0.90 halt · r.117 halt · r.131 May 24 Jun 01 Jun 08 Jun 15 Jun 22

Red ↔ Blue co-evolution

last 12 rounds · acme-fraud / spec sha 9f2a4c7b

Red attack diversity ↑ better (more strategies tried). Blue patch coverage ↑ better (more attack classes closed). Narrowing gap = converging. Last 4 rounds: gap 0.08 → 0.04 — converging.

1.0 0.5 0.0 red 0.86 blue 0.82 r.32 r.37 r.43
red · attack diversity ↑ better blue · patch coverage ↑ better gap = red − blue · narrowing means convergence

Cost per hack · distribution

87 hacks surfaced · mean $21.30 · median $14 · p95 $54 · budget alert at $40

budget alert · $40 $5 $10 $15 $20 $25 $30 $40 $50 $70 $90 $120+ USD per surfaced hack · lower is better

System health

live · /health
open health page →
PRODUCER · ok
claude-3.5-sonnet
p50 1.2s · p95 3.4s
VERIFIER · ok
v0.4.2
5 oracles · all green
JUDGE · slow
gpt-4o
p95 8.2s · 1/5 weight
CORPUS · sealed
4,820 attacks
sha 7d12…f4a · 2026-06-21
MOCK-LLM · off
disabled
/admin/debug

Recent runs

142 total · showing latest 8
all runs →
Run history · last 60 · oldest → newest certified review-only halted
May 24 Jun 08 Jun 22 · r_8f3a
RunStartedRoundsVerdictASR / DetCostMinsSpecAudit
r_8f3a 06-22 14:01Z 35/48 halted 0.41 / 0.86 $14.20 12 9f2a4c7b · sealed replay →
r_8f39 06-22 11:14Z 48/48 certified 0.18 / 0.94 $8.40 2 9f2a4c7b · sealed replay →
r_8f38 06-22 09:20Z 48/48 certified 0.16 / 0.95 $7.80 1 9f2a4c7b · sealed replay →
r_8f37 06-21 22:42Z 29/48 halted 0.44 / 0.84 $11.10 14 9f2a4c7b · sealed replay →
r_8f36 06-21 17:08Z 48/48 certified 0.19 / 0.93 $9.20 2 9f2a4c7b · sealed replay →
r_8f35 06-21 12:55Z 48/48 review-only 0.22 / 0.91 $8.90 6 9f2a4c7b · sealed replay →
r_8f34 06-21 08:11Z 48/48 certified 0.17 / 0.94 $8.10 1 9f2a4c7b · sealed replay →
r_8f33 06-20 19:33Z 48/48 certified 0.20 / 0.92 $8.70 2 9f2a4c7b · sealed replay →

Audit row replays

last 24h · 47 replays · 47 matched · 0 drift
nightly audit-bot · 02:00Z open replayer →
24h replay activity · 1h buckets every replay matched the original row hash
09:00Z14:00Z02:00Z · nightlynow · 09:32Z
Replayed atRowByReasonHashStatus
06-23 02:00:00Z a_8f3a_032_meta_1 audit-bot · nightly halt evidence 9c4b…2e10 matched .tgz ↓
06-22 17:08:54Z a_8f3a_032_meta_1 audit-bot · scheduled 4h drift check 9c4b…2e10 matched .tgz ↓
06-22 14:42:09Z a_8f3a_032_meta_1 s.lydon@acme manual triage 9c4b…2e10 matched .tgz ↓
06-22 14:31:02Z a_8f3a_028_oracle_3 k.osei@acme manual triage b71d…8a44 matched .tgz ↓
06-22 11:48:11Z a_8f39_041_meta_2 audit-bot · sample 1% spot check 2f0a…cc91 matched .tgz ↓
06-22 09:23:40Z a_8f38_014_oracle_5 audit-bot · sample 1% spot check 5e88…1d07 matched .tgz ↓
Methodology · open SR 11-7 report →
Numbers come from a sealed held-out corpus the producer never sees, graded by a verifier the producer never sees (sha 7d12…f4a). Runs halt at recall < 0.90 for three rounds running. If validation looks good and production looks bad, our numbers are wrong — not production.