REVIEW ONLY · not shown in product — state 1 of 3 · streaming live
C Crucible acme-fraud · operator
/runs/r_8f3a · live run view
THIS RUN $4.18 / $25.00
16.7% · cyan <70% · amber 70–90% · danger >90%
LAST ROUND $0.21
session · context only $65.58 · 7 runs · since 09:14Z
SSE connected · 14:08:44Z
RUNNING r_8f3a · 2026-06-22
round 23 / 48 · est. complete in 6m 12s 47.9%
target fraud_adapter@7c1d
spec · operator j.okafor
ASR vs Detection · live
updating each round
ASR 0.18 ↓ better
iabout
ASR · attack success rate
unit: fraction 0–1 · share of red-team attempts the producer slipped past the verifier.
source: oracle aggregator
Detection 0.94 ↑ better
iabout
Detection · recall
unit: fraction 0–1 · share of known attacks the verifier catches on the held-out attack set. Same series as white-box self-test recall used for halt gating.
source: held-out oracle
red line 0.90
Red line · Detection recall must stay >= 0.90. ASR worsens up · Detection worsens down. Shaded bands mark the danger direction for each series.
red line · recall >= 0.90 r.23 ▾ 1.0 0.5 0.0 r.1 r.12 r.23
Verdict stream
newest at top · 4 oracles + judge per round
All · selected Fails Evasions
RoundTimeOracleVerdictCost
ROUND 23 · STREAMING started 14:08:39Z · +5.2s elapsed · producer sandbox running 3 / 5 verdicts in
Held-Out Tests ✓ PASS
pass_rate 0.97 across 220 sealed cases · 14:08:41Z · $0.041
Metamorphic ✕ FAIL
label flipped on amount × 10 · 14:08:42Z · $0.038 ·
Differential WAITING
awaiting verdict
Property Fuzz WAITING
awaiting verdict
1/5LLM Judge ONE VOTE verdict pass · narrative consistent w/ obligations · 14:08:43Z ✓ PASS
round 22 · 14:08:24Z · $0.19 · aggregate PASS 5 / 5
held-out ok · 0.98 metamorphic ok differential agree fuzz no-fail judge 1/5 pass
round 21 · 14:08:06Z · $0.22 · aggregate FAIL · +1 evasion → catalog
held-out ok · 0.96 metamorphic flip differential disagree fuzz no-fail judge 1/5 pass
rounds 1 – 20 · 20 rolled up · all PASS · $3.27 ·
Red-agent reasoning · round 23
why was the last attempt caught?
step 1 · caught by metamorphic.invariance: the label flipped when I scaled amount × 10.
step 2 · hypothesis: the producer has learned a magnitude threshold near $2,500; it crosses it at 10×.
step 3 · propose minimal change — scale amount × 1.5, keep merchant family fixed, perturb timestamp by <= 4h to preserve intent.
step 4 · reuse tactic from strategy catalog · 17 prior wins.
query_target() · awaiting score
LIVE LEDGER
spent · this run $4.18
ceiling $25.00 16.7% used
avg / round$0.182
projected at r.48
method · avg $/round × rounds remaining; band is 95% CI over last 5 rounds.
spend ceiling
bar warns at 70% (amber), danger at 90% (red). parent doc · Model Risk Governance MRG-12.
SUBCOMPONENT HEALTH
Producer sandboxok · 24ms
Oracle · held-outok · 1.2s
Oracle · metamorphicok · 0.8s
Oracle · differentialslow · 4.7s
Oracle · fuzzok · 0.6s
LLM judgeok · 1.0s
STRATEGY CATALOG · this run
tactics logged this run +1
magnitude-creepr.21
scale × 1.5, perturb ts <= 4h · evades meta + diff
STREAM
transportSSE
events / sec2.4
last event14:08:44Z
reconnects0
REVIEW ONLY · not shown in product — state 2 of 3 · transport dropped · buffered
C Crucible acme-fraud · operator
/runs/r_8f3a · live run view
THIS RUN · stale $4.18 / $25.00
SESSION · stale $65.58 · 7 runs last known 14:08:32Z
SSE reconnecting · attempt 3
Transport dropped at 14:08:32Z · 12s ago. Showing last buffered state. The run is still executing on the backend — only this view is behind.
CATCHING UP · degraded r_8f3a · 2026-06-22
buffered through round 23 / 48 · +? events queued 47.9%
target fraud_adapter@7c1d
spec · operator j.okafor
disabled · transport reconnecting
disabled · transport reconnecting
ASR vs Detection · paused at 14:08:32Z
awaiting backfill
ASR 0.18 Detection 0.94
1.0 0.5 0.0 backfill pending
Verdict stream · buffered
no new events since 14:08:32Z
ROUND 23 · STALE last update 14:08:32Z · 3 of 5 verdicts in at drop
Held-Out Tests✓ PASS
pass_rate 0.97 · received 14:08:31Z
Metamorphic✕ FAIL
label flipped · received 14:08:32Z
DifferentialUNKNOWN
verdict not received before transport drop
Property FuzzUNKNOWN
verdict not received before transport drop
a backfill of +? events will arrive on reconnect · the backend is the source of truth, not this view
round 22 · 14:08:24Z · $0.19 · PASS
held-out ok · 0.98 metamorphic ok differential agree fuzz no-fail judge 1/5 pass
rounds 1 – 21 · 21 rolled up · last refresh 14:08:24Z
RECONNECT
reconnecting · attempt 3 of 6 retry in 4s
dropped at14:08:32Z
backoff1s · 2s · 4s · 8s
last errornet::ERR_TIMED_OUT
LEDGER · last refreshed 14:08:24Z
spent · last known $4.18
backend may have advanced past this. confirmed total will replace this on reconnect.
TRANSPORT
transportSSE · reconnecting
uptime · this run99.2%
reconnects1
last event14:08:32Z
BACKEND HEALTH · /health · 14:08:43Z
backend reachable · poll fallback
the run is still executing. only the live stream is behind.
REVIEW ONLY · not shown in product — state 3 of 3 · halt-certification triggered
HALT — CERTIFICATION SUSPENDED White-box self-test recall 0.71 is below the 0.90 red line. Run stopped at round 35. Certification holds until verifier recall recovers.
C Crucible acme-fraud · operator
/runs/r_8f3a · live run view
THIS RUN · FROZEN $14.20 / $25.00
56.8% · frozen at halt
HALT AT round 35 / 48
SSE drained · run stopped
HALTED · stopped r_8f3a · 2026-06-22
round 35 / 48 · stopped at 14:14:02Z · 13 rounds not executed 72.9%
target fraud_adapter@7c1d
spec · operator j.okafor
blocked · needs risk-approver override (recall red line · MRG-12.6)
ASR vs White-box recall · halted at r.35
crossed r.32 · confirmed over 3-round debounce window · halted r.35
ASR 0.42 White-box recall 0.71 red line 0.90
red line · recall >= 0.90 r.32 · crossed r.35 · HALT ▾ 1.0 0.5 0.0 r.1 r.18 r.35
Red line · White-box recall must stay >= 0.90. Same series as "Detection" in the live view — renamed here to match the halt gate. ASR ↓ better, recall ↑ better. Crossed at r.32 · 14:13:18Z; confirmed over a 3-round debounce window (r.32 → r.34) so a single bad round can't halt a run; halted at r.35 · 14:14:02Z.
RED LINE CROSSED recall · white-box self-test policy · recall red line ·
parent · Model Risk Governance MRG-12 — §12.4 spend ceiling, §12.6 recall red line.
RED LINE
>= 0.90
OBSERVED · r.35
0.71
CROSSED AT
r.32 · 14:13:18Z
The white-box adversary — handed the verification scheme — is succeeding on 29% of evasions that would otherwise be caught. Certification is suspended; the platform cannot honestly attest to catch rate. Next runs are blocked at the launcher until recall recovers to 0.90 on a held-out attack set.
Verdict stream · last 3 rounds before halt
ROUND 35 · HALTING 14:14:02Z · $0.41 · aggregate FAIL
aggregation · 3 of 4 deterministic oracles FAIL → weight 4/5 FAIL; LLM judge PASS · weight 1/5; weighted result FAIL (4/5).
Held-Out Tests✕ FAIL
pass_rate 0.83 · 37 regressed cases
Metamorphic✕ FAIL
invariance broke on 4 of 8 relations
Differential✕ FAIL
cross-family drift 0.071 > 0.030
Property Fuzz✓ PASS
no invariant violated · 2k samples
1/5LLM Judge ONE VOTE judge says pass · ignored by aggregator; three independent oracles voted FAIL ✓ PASS
round 34 · 14:13:41Z · $0.38 · aggregate FAIL · evasion logged
held-out fail · 0.86 metamorphic flip differential agree fuzz no-fail judge 1/5 pass
round 33 · 14:13:25Z · $0.36 · aggregate FAIL · evasion logged
held-out ok · 0.94 metamorphic flip differential disagree fuzz no-fail judge 1/5 pass
LEDGER · FROZEN
spent · this run $14.20
ceiling $25.0056.8% used at halt
no further charges will accrue. 13 rounds not executed.
EVASIONS CAPTURED · this run
tactics logged → catalog +4
magnitude-creepr.21, r.27
cross-family-driftr.30
temporal-shift-4hr.33
merchant-family-shuffler.34
NEXT
1Run the blue loop on these 4 captured tactics
2Re-evaluate on held-out attack set
3Recall >= 0.90 lifts the halt
AUDIT
halt id h_3a91
policy
signed-by platform-attestor
immutable · 35 rounds · 142 verdicts captured
INSPECT · LLM CALL
Metamorphic oracle · round 23
LLM call Sandbox job
model · claude-3-5-sonnet-20251022 · temp 0.0
invoked · 14:08:41Z · 1.2s
system-prompt sha ·
PROMPT
You are the metamorphic-relations oracle.
For each invariance relation, apply the
transform to the producer input, query the
target, and report whether the verdict held.
RAW RESPONSE
{"relations":[
  {"name":"amount_x10","held":false,
   "before":"fraud","after":"clean"},
  {"name":"merchant_shuffle","held":true},
  {"name":"ts_shift_1h","held":true}
 ],"verdict":"fail"}
PARSED OUTPUT
amount_x10 flipped fraud → clean · FAIL
tokens 1,847 cost $0.038 latency 0.8s
API keys, DB credentials, and sandbox tokens are redacted — the only values ever hidden.
REPLAY · audit row a_8f3a-r23
Original vs replay · seed 0x91af
ORIGINAL · 14:08:42Z
REPLAY · just now
verdict: fail
verdict: fail
relation: amount_x10
relation: amount_x10
before: fraud
before: fraud
after: clean
after: clean
tokens: 1847
tokens: 1847
replay matches original byte-for-byte · verdict reproducible from seed
HALT POLICY · recall red line · MRG-12.6
Why this run halted
parent · Model Risk Governance MRG-12

Halt-certification stops a run when the white-box self-test recall — the catch rate measured against an adversary that already knows the verification scheme — drops below a published red line. The platform cannot honestly attest to catch rate when this line is crossed, so it stops certifying rather than ship a false claim.

RULE
// entry and exit thresholds use the identical metric on the
// identical population: white_box recall measured on a fresh
// held-out attack set, so the producer can't re-overfit to
// the known white-box scheme between halt and resume

halt if recall_white_box < 0.90 on held_out_attack_set
resume if recall_white_box >= 0.90 on held_out_attack_set (binding)
OBSERVATION
recall_white_box(r.35) = 0.71 ◄ below red line
This is capability-threshold gating in the spirit of a Responsible Scaling Policy: the platform must measurably catch an adversary that knows the scheme before certifying. The blue loop, not an override, is the path back to certification.