C Crucible
/benchmarks · acme-fraud · seeded-v3
export csv ↗ corpus tarball ↗

Benchmark · seeded-v3

frozen 2026-05-01 · 2,400 instances · 7 cheat families · sha256 9c14…aa02
held-out corpus · sealed · publicly reproducible from manifest · do not retrain on this corpus
Corpus composition
META-INV-01..04
1,488
JUDGE family
432
DIFFERENTIAL
288
FUZZ
192
Reproduce
$ crucible bench pull seeded-v3
$ crucible bench verify --sha 9c14…aa02
$ crucible bench run \
    --producer your-model \
    --corpus seeded-v3 \
    --emit leaderboard.json
$ crucible bench submit leaderboard.json
Leaderboard · seeded-v3ranked by recall · ties broken by cost · 14 submissions
#SubmissionRecallASRv↔h gapCost / hackHuman-min / 1kState
1
crucible-blue-v2.4
house · 9f2a4c7b · r.46
0.930.070.02$1.183.4verified
2
harbor-research / hr-detect-7
external · 2026-06-18
0.910.090.03$1.424.1verified
3
stanford-mlsys / sm-judge-iv
external · 2026-06-12
0.880.120.08$0.945.8verified
4
crucible-blue-v2.3
house · pre-patch p_2a17
0.860.140.02$1.213.6verified
5
openrisk / or-honest-1
external · 2026-05-30
0.840.160.06$2.107.2verified
6
prompt-shield / ps-rc4
external · 2026-05-22 · flagged
0.900.100.24$0.713.0overfit-flag
7
baseline · llm-judge-only
house · reference
0.620.380.03$0.422.1reference
v↔h gap = recall on validation minus recall on the sealed held-out corpus. A gap above 0.05 trips the overfit-flag. Submissions with a flag remain on the board for transparency; they are not eligible for the leader slot.