C Crucible acme-fraud · operator

        24h
        7d
        30d · selected
        90d
        all
      

          MONTH
          $1,847 / $5,000
        

      window · 2026-05-24 → 2026-06-22
      |
      scope · acme-fraud / all sealed specs
      |
      runs · 142 (127 completed · 11 halted · 4 review-only)
      |
      last refreshed · 2026-06-23 09:32:14Z · auto every 5m
    

Outcomes

computed on the sealed held-out corpus · sha 7d12…f4a · refreshed 09:32:14Z

Undetected-hack rate ratio · ↓ better

0.018

            ↓ 0.014 vs prev 30d
            prev 0.032
          

          healthy · below 0.05 target
          computed on 4,820 held-out attacks across 142 runs
        

            30-day trend
            target ≤ 0.05
          

Val–heldout gap

0.04 ↓ 0.02

val 0.96 · heldout 0.92 · ↓ better · ≤ 0.05 ok

Recall

0.94 ↑ 0.02

red line 0.90 · ↑ better · margin 0.04

Cost / undetected hack

$21.30 ↑ $3.10

$1,847 / 87 hacks · ↓ better · alert ≥ $40

Human-min / 1k

3.4 ↓ 0.8

target ≤ 5.0 · ↓ better · 11 halts triaged this window

ASR vs Detection · 30 days

          ASR 0.18 ↓ better
          Detection 0.94 ↑ better
          red line 0.90
        

ASR (amber) and detection recall (cyan) across 142 runs. Recall fell below the 0.90 red line twice — r.117 and r.131 — both recovered after a blue-loop patch.

Red ↔ Blue co-evolution

last 12 rounds · acme-fraud / spec sha 9f2a4c7b

Red attack diversity ↑ better (more strategies tried). Blue patch coverage ↑ better (more attack classes closed). Narrowing gap = converging. Last 4 rounds: gap 0.08 → 0.04 — converging.

          red · attack diversity ↑ better
          blue · patch coverage ↑ better
          gap = red − blue · narrowing means convergence
        

Cost per hack · distribution

87 hacks surfaced · mean $21.30 · median $14 · p95 $54 · budget alert at $40

System health

live · /health

open health page →

PRODUCER · ok

claude-3.5-sonnet

p50 1.2s · p95 3.4s

VERIFIER · ok

v0.4.2

5 oracles · all green

JUDGE · slow

gpt-4o

p95 8.2s · 1/5 weight

CORPUS · sealed

4,820 attacks

sha 7d12…f4a · 2026-06-21

MOCK-LLM · off

disabled

/admin/debug

Recent runs

142 total · showing latest 8

all runs →

          Run history · last 60 · oldest → newest
          
            certified
            review-only
            halted

          May 24
          Jun 08
          Jun 22 · r_8f3a
        

        RunStartedRoundsVerdictASR / DetCostMinsSpecAudit
      

        r_8f3a
        06-22 14:01Z
        35/48
        halted
        0.41 / 0.86
        $14.20
        12
        9f2a4c7b · sealed
        replay →
      

        r_8f39
        06-22 11:14Z
        48/48
        certified
        0.18 / 0.94
        $8.40
        2
        9f2a4c7b · sealed
        replay →
      

        r_8f38
        06-22 09:20Z
        48/48
        certified
        0.16 / 0.95
        $7.80
        1
        9f2a4c7b · sealed
        replay →
      

        r_8f37
        06-21 22:42Z
        29/48
        halted
        0.44 / 0.84
        $11.10
        14
        9f2a4c7b · sealed
        replay →
      

        r_8f36
        06-21 17:08Z
        48/48
        certified
        0.19 / 0.93
        $9.20
        2
        9f2a4c7b · sealed
        replay →
      

        r_8f35
        06-21 12:55Z
        48/48
        review-only
        0.22 / 0.91
        $8.90
        6
        9f2a4c7b · sealed
        replay →
      

        r_8f34
        06-21 08:11Z
        48/48
        certified
        0.17 / 0.94
        $8.10
        1
        9f2a4c7b · sealed
        replay →
      

        r_8f33
        06-20 19:33Z
        48/48
        certified
        0.20 / 0.92
        $8.70
        2
        9f2a4c7b · sealed
        replay →
      

Audit row replays

last 24h · 47 replays · 47 matched · 0 drift

nightly audit-bot · 02:00Z open replayer →

          24h replay activity · 1h buckets
          every replay matched the original row hash
        

          09:00Z14:00Z02:00Z · nightlynow · 09:32Z
        

        Replayed atRowByReasonHashStatus
      

        06-23 02:00:00Z
        a_8f3a_032_meta_1
        audit-bot · nightly
        halt evidence
        9c4b…2e10
        matched
        .tgz ↓
      

        06-22 17:08:54Z
        a_8f3a_032_meta_1
        audit-bot · scheduled
        4h drift check
        9c4b…2e10
        matched
        .tgz ↓
      

        06-22 14:42:09Z
        a_8f3a_032_meta_1
        s.lydon@acme
        manual triage
        9c4b…2e10
        matched
        .tgz ↓
      

        06-22 14:31:02Z
        a_8f3a_028_oracle_3
        k.osei@acme
        manual triage
        b71d…8a44
        matched
        .tgz ↓
      

        06-22 11:48:11Z
        a_8f39_041_meta_2
        audit-bot · sample
        1% spot check
        2f0a…cc91
        matched
        .tgz ↓
      

        06-22 09:23:40Z
        a_8f38_014_oracle_5
        audit-bot · sample
        1% spot check
        5e88…1d07
        matched
        .tgz ↓
      

        Methodology
        ·
        open SR 11-7 report →
      

        Numbers come from a sealed held-out corpus the producer never sees, graded by a verifier the producer never sees (sha 7d12…f4a). Runs halt at recall < 0.90 for three rounds running. If validation looks good and production looks bad, our numbers are wrong — not production.