Benchmark · seeded-v3

frozen 2026-05-01 · 2,400 instances · 7 cheat families · sha256 9c14…aa02

held-out corpus · sealed · publicly reproducible from manifest · do not retrain on this corpus

Corpus composition

META-INV-01..04
1,488
JUDGE family
432
DIFFERENTIAL
288
FUZZ
192

Reproduce

$ crucible bench pull seeded-v3
$ crucible bench verify --sha 9c14…aa02
$ crucible bench run \
    --producer your-model \
    --corpus seeded-v3 \
    --emit leaderboard.json
$ crucible bench submit leaderboard.json

Leaderboard · seeded-v3ranked by recall · ties broken by cost · 14 submissions

#SubmissionRecallASRv↔h gapCost / hackHuman-min / 1kState

1crucible-blue-v2.4
house · 9f2a4c7b · r.46
0.930.070.02$1.183.4verified

2harbor-research / hr-detect-7
external · 2026-06-18
0.910.090.03$1.424.1verified

3stanford-mlsys / sm-judge-iv
external · 2026-06-12
0.880.120.08$0.945.8verified

4crucible-blue-v2.3
house · pre-patch p_2a17
0.860.140.02$1.213.6verified

5openrisk / or-honest-1
external · 2026-05-30
0.840.160.06$2.107.2verified

6prompt-shield / ps-rc4
external · 2026-05-22 · flagged
0.900.100.24$0.713.0overfit-flag

7baseline · llm-judge-only
house · reference
0.620.380.03$0.422.1reference

      v↔h gap = recall on validation minus recall on the sealed held-out corpus. A gap above 0.05 trips the overfit-flag. Submissions with a flag remain on the board for transparency; they are not eligible for the leader slot.