An AI scientist that runs thousands of GPU experiments autonomously — and the code that makes it tick. No CS degree required.
Scroll down to begin ↓
Trace the journey from a single command to a trained neural network
Imagine you’re a researcher. You sit down at a computer connected to a powerful GPU, type python train.py, and walk away. Five minutes later, you have a trained language model and a score that tells you how good it is.
But what actually happened in those five minutes? Let's trace every step.
The very first thing that runs is train.py — a tiny 40-line file whose only job is to figure out what kind of GPU you have and hand off to the right specialist.
Think of it like a hospital receptionist. You walk in and say "I need help." The receptionist doesn't treat you — they figure out which department to send you to.
from backends import detect_backend
from backends.registry import get_training_script
backend = detect_backend()
script_path = get_training_script(backend)
os.execv(sys.executable, [sys.executable, script_path])
Bring in the "detective" tools that can identify GPU hardware
Ask: "What GPU is installed on this machine?"
Look up: "For that GPU type, which training script should I run?"
Replace myself entirely with the specialist script — I'm done, the specialist takes over from here
This "detect and hand off" pattern is everywhere in software. Your phone does it when you tap a link — it figures out which app should open it (Safari? Chrome? YouTube?) and hands it off. When you tell an AI agent "build me a website," this pattern is how it figures out which tool to use for each step.
The detective checks hardware in a specific order, from most powerful to most common:
Specialized AI chips — rarest, but purpose-built for training
AMD's datacenter GPUs — 192 GB of memory, massive compute
The industry standard — from RTX 4090 to H100
Your MacBook's built-in GPU — surprisingly capable
Once the specialist script takes over, the same thing happens on every platform: build a small neural network, feed it text data for exactly 5 minutes, then measure how well it learned.
Every software project has characters. Here are the actors in this one.
Think of this codebase like a film production. Each character has a specific role, and the movie falls apart if any one of them goes missing.
tui/orchestrator.pyThe director. Runs the experiment loop, tells everyone what to do, keeps track of results.
tui/llm_backend.pyThe scientist. Analyzes past experiments and proposes what to try next.
platforms/*/train_*.pyThe lab technician. Actually runs the GPU and trains the neural network.
tui/git_manager.pyThe librarian. Saves every experiment as a permanent record you can revisit.
tui/results.pyThe scorekeeper. Writes results to a spreadsheet that never loses data, even if the power goes out.
prepare.pyThe prep cook. Downloads training data and builds the vocabulary before anyone else starts working.
Here's how these characters are organized in the actual code:
Watch how the Orchestrator coordinates a single experiment:
The engine that runs thousands of experiments while you sleep
Traditional research: a human reads papers, forms a hypothesis, runs an experiment, analyzes results, and repeats. This system does the same thing, but the "human" is Claude, and it never needs sleep.
Each cycle takes about 6 minutes: 1 minute for Claude to think + 5 minutes of training. That means roughly 80 experiments per 8-hour overnight run.
Before each experiment, the Orchestrator sends Claude a carefully crafted message containing the experiment history and the editable code. Claude's job: change ONE thing in the hyperparameter block.
# Hyperparameters
DEPTH = 8
DEVICE_BATCH_SIZE = 64
TOTAL_BATCH_SIZE = 2 ** 17
LEARNING_RATE = 0.003
WEIGHT_DECAY = 0.05
WARMUP_ITERS = 0
COOLDOWN_FRAC = 0.4
These are the settings Claude can adjust:
DEPTH = 8 — How many layers deep the brain is (more = smarter, but slower)
DEVICE_BATCH_SIZE = 64 — How many text samples to process at once
TOTAL_BATCH_SIZE — Total samples before updating the brain
LEARNING_RATE — How aggressively to learn (too high = unstable, too low = slow)
WEIGHT_DECAY — Slight "forgetfulness" that prevents over-memorizing
WARMUP_ITERS — Steps of gentle start before full speed
COOLDOWN_FRAC — Portion of time spent gradually slowing down
Every experiment produces one row in a TSV file. This is the project's source of truth — the complete history of every experiment ever run.
expExperiment number (exp0, exp1, exp2...)descriptionWhat Claude changed ("increase depth to 10")val_bpbThe score — lower is better (bits per byte)mfuHow efficiently the GPU was used (%)statuskeep, discard, baseline, or crashbaseline_shaGit fingerprint linking to exact code versionDifferent tokenizers split text into different-sized pieces, making raw loss numbers incomparable. Bits-per-byte converts to a universal scale: "how many bits of information do you need to predict each byte of text?" Lower means the model learned better patterns.
What's actually inside the brain this system trains?
The model is a Transformer — the same family as GPT and Claude, but miniature. Where GPT-4 has hundreds of billions of parameters, this one has 1–10 million. It's small enough to train in 5 minutes, but big enough to actually learn language patterns.
Every Transformer is built from the same LEGO bricks, stacked on top of each other. Imagine a conveyor belt in a factory: raw text goes in one end, predictions come out the other, passing through specialized stations.
Converts each word-piece into a list of numbers (a "vector") that the math can work with. Like translating English into the language of mathematics.
Each word "looks at" every other word to understand context. "Bank" means something different near "river" vs. "money." This layer figures that out.
A "thinking" layer that processes what attention found. It expands the information, applies a non-linear filter, then compresses it back down.
Converts the final numbers back into a probability for every possible next word. The highest probability wins.
This isn't a textbook Transformer. It uses several cutting-edge techniques that improve performance on small models:
Not every layer needs to look at the full text. Alternate layers use a "short window" — like reading with a magnifying glass vs. scanning the whole page. Saves compute, barely hurts quality.
Some layers get an extra "cheat sheet" that feeds raw vocabulary information directly into the attention. Like an open-book exam instead of pure memory.
Each layer gets its own volume dial. The model learns how much to "listen" to each layer's contribution vs. the original signal.
Prevents the model from becoming "too confident." A mathematical ceiling that squashes extreme predictions, keeping the model honest.
The custom optimizer that's half math genius, half Swiss Army knife
Training a neural network is like navigating a mountainous landscape blindfolded. You can feel the slope under your feet (the gradient), but you need a strategy for which direction to step and how far. That strategy is the optimizer.
Most AI systems use one optimizer for everything. Autoresearch uses a hybrid — two different strategies combined into one, each applied to the type of parameter it works best on.
For the big matrix parameters (attention, MLP weights). Uses a fancy math trick called "orthogonalization" — imagine straightening all the learning directions so they don't interfere with each other.
For everything else (embeddings, scaling factors, biases). The industry workhorse — tracks both momentum and variance of gradients to make smooth, stable updates.
Different parameter shapes have different geometry. A 768×768 weight matrix lives in a very different mathematical space than a single scaling number. Using the right tool for each shape is like using a screwdriver for screws and a wrench for bolts, instead of hammering everything.
Not all parameters learn at the same speed. The optimizer assigns different learning rates to different groups, like different speed limits for different road types:
0.04Transformer matrices (Muon) — fast learners0.6Token embeddings (AdamW) — vocabulary needs lots of tuning0.004Output head (AdamW) — initialized near zero, learns gently0.5Skip-connection weights (AdamW) — quick-adjusting dials0.005Residual scales (AdamW) — per-layer volume, nearly fixedHow the system survives crashes, power outages, and GPU meltdowns
Imagine you start an 80-experiment run before bed. At experiment #47, the cloud provider reboots the machine, or the GPU runs out of memory, or the internet drops. Without protection, you'd lose everything and have to restart from scratch.
This system is built like a black box flight recorder — it survives almost anything.
The biggest risk is a half-written results file. If the power dies mid-write, the file could be corrupted. The fix: atomic writes.
results.tsv.tmp gets the new data
fsync() ensures the data is physically written, not just cached
os.replace() swaps the temp file for the real one in a single, uninterruptible operation
Every database, every banking system, every app that can't afford to lose data uses this same principle. The operating system guarantees that rename() either fully completes or doesn't happen at all — there's no in-between. When you tell AI to "save data safely," this is the technique you want it to use.
How do you know if an overnight run is still alive? The system writes a "heartbeat" file every iteration — like a hospital monitor beeping to show the patient is still breathing.
class Heartbeat:
def update(self, status, experiment, message):
data = {
"status": status,
"experiment": experiment,
"timestamp": time.time(),
"pid": os.getpid(),
}
atomic_write(self.path, json.dumps(data))
Create a "heartbeat monitor" object
Every time we finish something, record a pulse:
What are we doing? ("training", "evaluating")
Which experiment number?
Exactly when? (for "how long since last beat?")
Which process is running? (to detect zombies)
Save this pulse using the atomic write technique from before
When the system restarts after a crash, it doesn't start over. It reads the existing results file, finds the last completed experiment, cleans up any half-finished work (like orphaned git commits), and picks up where it left off.
Zoom out: how four GPU platforms become one unified research machine
The real engineering achievement here isn't any single component — it's that the same experiment can run on an Apple MacBook, an NVIDIA data center GPU, an AMD MI300X, or an Intel Gaudi chip, and the results are directly comparable.
Think of it like a recipe that works in any kitchen: gas stove, electric, induction, or campfire. The dish tastes the same because the recipe (the algorithm) is the same — only the cooking equipment (the GPU) changes.
You've just learned how a real, production AI research system works from the inside. Here's what this knowledge unlocks:
You can now say "use a dispatcher pattern" instead of "make it work on different computers." Precise vocabulary = precise results.
When something breaks, you know to check: Is it the data? The model? The optimizer? The platform layer? You can narrow down problems systematically.
Dispatcher pattern vs. monolith. Atomic writes vs. hope-for-the-best. Hybrid optimizers vs. one-size-fits-all. You have the vocabulary to choose wisely.
You're setting up autoresearch on a new AMD MI300X machine and the training runs are 4.7x slower than the same experiments on NVIDIA. You check the logs and see "torch.compile disabled." What's happening?