Interactive Course

How Autoresearch Works

An AI scientist that runs thousands of GPU experiments autonomously — and the code that makes it tick. No CS degree required.

2,637+Experiments run

4GPU platforms

5 minPer experiment

1AI scientist (Claude)

Scroll down to begin ↓

01

What Happens When You Hit "Run"

Trace the journey from a single command to a trained neural network

The One-Line Magic Trick

Imagine you’re a researcher. You sit down at a computer connected to a powerful GPU, type python train.py, and walk away. Five minutes later, you have a trained language model and a score that tells you how good it is.

But what actually happened in those five minutes? Let's trace every step.

Step 1: The Dispatcher

The very first thing that runs is train.py — a tiny 40-line file whose only job is to figure out what kind of GPU you have and hand off to the right specialist.

Think of it like a hospital receptionist. You walk in and say "I need help." The receptionist doesn't treat you — they figure out which department to send you to.

CODE

from backends import detect_backend
from backends.registry import get_training_script

backend = detect_backend()
script_path = get_training_script(backend)

os.execv(sys.executable, [sys.executable, script_path])

PLAIN ENGLISH

Bring in the "detective" tools that can identify GPU hardware

Ask: "What GPU is installed on this machine?"

Look up: "For that GPU type, which training script should I run?"

Replace myself entirely with the specialist script — I'm done, the specialist takes over from here

💡

Key Insight: The Dispatcher Pattern

This "detect and hand off" pattern is everywhere in software. Your phone does it when you tap a link — it figures out which app should open it (Safari? Chrome? YouTube?) and hands it off. When you tell an AI agent "build me a website," this pattern is how it figures out which tool to use for each step.

Step 2: Detection Priority

The detective checks hardware in a specific order, from most powerful to most common:

1

Intel Gaudi HPU

Specialized AI chips — rarest, but purpose-built for training

2

AMD ROCm (MI300X)

AMD's datacenter GPUs — 192 GB of memory, massive compute

3

NVIDIA CUDA

The industry standard — from RTX 4090 to H100

4

Apple MLX / MPS

Your MacBook's built-in GPU — surprisingly capable

Step 3: Five Minutes of Training

Once the specialist script takes over, the same thing happens on every platform: build a small neural network, feed it text data for exactly 5 minutes, then measure how well it learned.

D

Data Loader

M

Neural Network

O

Optimizer

E

Evaluator

Click "Next Step" to see the training loop

Step 0 / 6

Check Your Understanding

You want to add support for a brand-new GPU from a startup called "NovAI." Where would you start?

02

Meet the Cast

Every software project has characters. Here are the actors in this one.

The Six Main Characters

Think of this codebase like a film production. Each character has a specific role, and the movie falls apart if any one of them goes missing.

O

The Orchestrator tui/orchestrator.py

The director. Runs the experiment loop, tells everyone what to do, keeps track of results.

C

Claude (LLM Backend) tui/llm_backend.py

The scientist. Analyzes past experiments and proposes what to try next.

T

The Trainer platforms/*/train_*.py

The lab technician. Actually runs the GPU and trains the neural network.

G

Git Manager tui/git_manager.py

The librarian. Saves every experiment as a permanent record you can revisit.

R

Results Logger tui/results.py

The scorekeeper. Writes results to a spreadsheet that never loses data, even if the power goes out.

P

Data Preparer prepare.py

The prep cook. Downloads training data and builds the vocabulary before anyone else starts working.

The Project Layout

Here's how these characters are organized in the actual code:

autoresearch-unified/

train.py The receptionist (40 lines)

prepare.py Downloads data + trains tokenizer

run_suite.py Runs experiments across many datasets overnight

tui/ The brain — orchestration, UI, Claude API

orchestrator.py The director (880 lines)

llm_backend.py Talks to Claude

results.py Crash-safe scorekeeper

resilience.py Crash protection

backends/ GPU detection + optimizers

__init__.py Hardware detective

muon_cuda.py NVIDIA optimizer

muon_rocm.py AMD optimizer

muon_mlx.py Apple optimizer

platforms/ One training script per GPU type

cuda/ NVIDIA GPUs

rocm/ AMD GPUs

metal/ Apple Silicon

gaudi/ Intel Gaudi

How They Talk to Each Other

Watch how the Orchestrator coordinates a single experiment:

Experiment #42 — Component Chat

0 / 8 messages

Check Your Understanding

The training just finished, but the results show a worse score than before. Which component decides whether to keep or discard the experiment?

03

The Experiment Loop

The engine that runs thousands of experiments while you sleep

The Scientific Method, Automated

Traditional research: a human reads papers, forms a hypothesis, runs an experiment, analyzes results, and repeats. This system does the same thing, but the "human" is Claude, and it never needs sleep.

Each cycle takes about 6 minutes: 1 minute for Claude to think + 5 minutes of training. That means roughly 80 experiments per 8-hour overnight run.

What Claude Actually Sees

Before each experiment, the Orchestrator sends Claude a carefully crafted message containing the experiment history and the editable code. Claude's job: change ONE thing in the hyperparameter block.

WHAT CLAUDE SEES

# Hyperparameters
DEPTH = 8
DEVICE_BATCH_SIZE = 64
TOTAL_BATCH_SIZE = 2 ** 17
LEARNING_RATE = 0.003
WEIGHT_DECAY = 0.05
WARMUP_ITERS = 0
COOLDOWN_FRAC = 0.4

PLAIN ENGLISH

These are the settings Claude can adjust:

DEPTH = 8 — How many layers deep the brain is (more = smarter, but slower)

DEVICE_BATCH_SIZE = 64 — How many text samples to process at once

TOTAL_BATCH_SIZE — Total samples before updating the brain

LEARNING_RATE — How aggressively to learn (too high = unstable, too low = slow)

WEIGHT_DECAY — Slight "forgetfulness" that prevents over-memorizing

WARMUP_ITERS — Steps of gentle start before full speed

COOLDOWN_FRAC — Portion of time spent gradually slowing down

The Results Spreadsheet

Every experiment produces one row in a TSV file. This is the project's source of truth — the complete history of every experiment ever run.

expExperiment number (exp0, exp1, exp2...)

descriptionWhat Claude changed ("increase depth to 10")

val_bpbThe score — lower is better (bits per byte)

mfuHow efficiently the GPU was used (%)

statuskeep, discard, baseline, or crash

baseline_shaGit fingerprint linking to exact code version

🔍

Why "Bits Per Byte"?

Different tokenizers split text into different-sized pieces, making raw loss numbers incomparable. Bits-per-byte converts to a universal scale: "how many bits of information do you need to predict each byte of text?" Lower means the model learned better patterns.

Check Your Understanding

Claude proposes changing BOTH the learning rate and the model depth at the same time. Why would the system's rules prevent this?

04

Inside the Neural Network

What's actually inside the brain this system trains?

A Transformer, in Plain English

The model is a Transformer — the same family as GPT and Claude, but miniature. Where GPT-4 has hundreds of billions of parameters, this one has 1–10 million. It's small enough to train in 5 minutes, but big enough to actually learn language patterns.

The Building Blocks

Every Transformer is built from the same LEGO bricks, stacked on top of each other. Imagine a conveyor belt in a factory: raw text goes in one end, predictions come out the other, passing through specialized stations.

Aa

Embedding

Converts each word-piece into a list of numbers (a "vector") that the math can work with. Like translating English into the language of mathematics.

👁

Attention

Each word "looks at" every other word to understand context. "Bank" means something different near "river" vs. "money." This layer figures that out.

⚡

MLP (ReLU²)

A "thinking" layer that processes what attention found. It expands the information, applies a non-linear filter, then compresses it back down.

🎯

Output Head

Converts the final numbers back into a probability for every possible next word. The highest probability wins.

The Clever Tricks

This isn't a textbook Transformer. It uses several cutting-edge techniques that improve performance on small models:

1

Sliding Window Attention

Not every layer needs to look at the full text. Alternate layers use a "short window" — like reading with a magnifying glass vs. scanning the whole page. Saves compute, barely hurts quality.

2

Value Residual (ResFormer)

Some layers get an extra "cheat sheet" that feeds raw vocabulary information directly into the attention. Like an open-book exam instead of pure memory.

3

Learnable Residual Scaling

Each layer gets its own volume dial. The model learns how much to "listen" to each layer's contribution vs. the original signal.

4

Logit Softcap

Prevents the model from becoming "too confident." A mathematical ceiling that squashes extreme predictions, keeping the model honest.

Check Your Understanding

You're building a chatbot and the AI keeps giving wildly overconfident answers. Which technique from this module could help?

05

The Secret Sauce: MuonAdamW

The custom optimizer that's half math genius, half Swiss Army knife

What Is an Optimizer?

Training a neural network is like navigating a mountainous landscape blindfolded. You can feel the slope under your feet (the gradient), but you need a strategy for which direction to step and how far. That strategy is the optimizer.

Most AI systems use one optimizer for everything. Autoresearch uses a hybrid — two different strategies combined into one, each applied to the type of parameter it works best on.

Two Optimizers, One System

M

Muon

For the big matrix parameters (attention, MLP weights). Uses a fancy math trick called "orthogonalization" — imagine straightening all the learning directions so they don't interfere with each other.

A

AdamW

For everything else (embeddings, scaling factors, biases). The industry workhorse — tracks both momentum and variance of gradients to make smooth, stable updates.

💡

Why Two Optimizers?

Different parameter shapes have different geometry. A 768×768 weight matrix lives in a very different mathematical space than a single scaling number. Using the right tool for each shape is like using a screwdriver for screws and a wrench for bolts, instead of hammering everything.

The Learning Rate Menu

Not all parameters learn at the same speed. The optimizer assigns different learning rates to different groups, like different speed limits for different road types:

0.04Transformer matrices (Muon) — fast learners

0.6Token embeddings (AdamW) — vocabulary needs lots of tuning

0.004Output head (AdamW) — initialized near zero, learns gently

0.5Skip-connection weights (AdamW) — quick-adjusting dials

0.005Residual scales (AdamW) — per-layer volume, nearly fixed

Check Your Understanding

You're debugging a training run where the model starts learning well but then "forgets" everything after step 200. Which optimizer feature is most likely relevant?

06

When Things Break

How the system survives crashes, power outages, and GPU meltdowns

The Problem: 3 AM Crashes

Imagine you start an 80-experiment run before bed. At experiment #47, the cloud provider reboots the machine, or the GPU runs out of memory, or the internet drops. Without protection, you'd lose everything and have to restart from scratch.

This system is built like a black box flight recorder — it survives almost anything.

Defense #1: Atomic Writes

The biggest risk is a half-written results file. If the power dies mid-write, the file could be corrupted. The fix: atomic writes.

1

Write to a temporary file

results.tsv.tmp gets the new data

2

Force-flush to disk

fsync() ensures the data is physically written, not just cached

3

Atomic rename

os.replace() swaps the temp file for the real one in a single, uninterruptible operation

💡

Atomic Operations Are Everywhere

Every database, every banking system, every app that can't afford to lose data uses this same principle. The operating system guarantees that rename() either fully completes or doesn't happen at all — there's no in-between. When you tell AI to "save data safely," this is the technique you want it to use.

Defense #2: Heartbeat Monitoring

How do you know if an overnight run is still alive? The system writes a "heartbeat" file every iteration — like a hospital monitor beeping to show the patient is still breathing.

CODE

class Heartbeat:
    def update(self, status, experiment, message):
        data = {
            "status": status,
            "experiment": experiment,
            "timestamp": time.time(),
            "pid": os.getpid(),
        }
        atomic_write(self.path, json.dumps(data))

PLAIN ENGLISH

Create a "heartbeat monitor" object

Every time we finish something, record a pulse:

What are we doing? ("training", "evaluating")

Which experiment number?

Exactly when? (for "how long since last beat?")

Which process is running? (to detect zombies)

Save this pulse using the atomic write technique from before

Defense #3: Resume from Crash

When the system restarts after a crash, it doesn't start over. It reads the existing results file, finds the last completed experiment, cleans up any half-finished work (like orphaned git commits), and picks up where it left off.

Check Your Understanding

Your overnight run crashed at experiment #50. When you restart, the system finds a git commit for "exp50" but no matching row in results.tsv. What should it do?

07

The Big Picture

Zoom out: how four GPU platforms become one unified research machine

One Brain, Four Bodies

The real engineering achievement here isn't any single component — it's that the same experiment can run on an Apple MacBook, an NVIDIA data center GPU, an AMD MI300X, or an Intel Gaudi chip, and the results are directly comparable.

Think of it like a recipe that works in any kitchen: gas stove, electric, induction, or campfire. The dish tastes the same because the recipe (the algorithm) is the same — only the cooking equipment (the GPU) changes.

The Architecture, at a Glance

Shared Layer (Platform-Agnostic)

🎬

Orchestrator + TUI

📦

Data Pipeline (prepare.py)

🤖

Claude API (llm_backend.py)

🛡

Resilience Layer

Platform Layer (GPU-Specific)

🟢

CUDA Training

🔴

ROCm Training

🔵

MLX Training

🟣

Gaudi HPU Training

Click any component to learn what it does

What You Can Do Now

You've just learned how a real, production AI research system works from the inside. Here's what this knowledge unlocks:

Steer AI Agents Better

You can now say "use a dispatcher pattern" instead of "make it work on different computers." Precise vocabulary = precise results.

Debug Smarter

When something breaks, you know to check: Is it the data? The model? The optimizer? The platform layer? You can narrow down problems systematically.

Make Architecture Decisions

Dispatcher pattern vs. monolith. Atomic writes vs. hope-for-the-best. Hybrid optimizers vs. one-size-fits-all. You have the vocabulary to choose wisely.

Final Challenge

Scenario

You're setting up autoresearch on a new AMD MI300X machine and the training runs are 4.7x slower than the same experiments on NVIDIA. You check the logs and see "torch.compile disabled." What's happening?