Open benchmark for evaluating LLMs as autonomous ML researchers. Models are scored on their ability to iteratively improve a neural network's validation loss through code modifications, measured by keep rate (fraction of experiments that improve the baseline).
| # | Model | Hardware | Dataset | Keep Rate | Crash Rate | Best val_bpb | Runs | Contributor | Status |
|---|
data/results/{dataset}/{your-github-handle}-{gpu}/exp, description, val_bpb, peak_mem_gb, tok_sec, mfu, steps, status, notes, gpu_name, baseline_sha, watts, joules_per_token, total_energy_joules.
Status must be one of: baseline, keep, discard, crash, skip.
Filename pattern: results_{model}_{rN}.tsv (e.g. results_sonnet46_r1.tsv).