Autoresearch Benchmark

Open benchmark for evaluating LLMs as autonomous ML researchers. Models are scored on their ability to iteratively improve a neural network's validation loss through code modifications, measured by keep rate (fraction of experiments that improve the baseline).

Runs: -
Models: -
Datasets: -
Updated: -
# Model Hardware Dataset Keep Rate Crash Rate Best val_bpb Runs Contributor Status

How to contribute

  1. Fork elementalcollision/autoresearch-unified
  2. Run autoresearch with any LLM on any GPU
  3. Place results TSV in data/results/{dataset}/{your-github-handle}-{gpu}/
  4. Open a pull request — the leaderboard rebuilds on merge
TSV format: 14 columns — exp, description, val_bpb, peak_mem_gb, tok_sec, mfu, steps, status, notes, gpu_name, baseline_sha, watts, joules_per_token, total_energy_joules. Status must be one of: baseline, keep, discard, crash, skip. Filename pattern: results_{model}_{rN}.tsv (e.g. results_sonnet46_r1.tsv).