Roadmap to Mastery: ML / LLM Research

Overview & philosophy

Most people don't fail this for lack of resources. There are too many resources. They fail because they learn in the wrong shape. So the philosophy below matters more than the curriculum does.

The honest premise

Mastery is years, not months. Be suspicious of anyone selling "AI researcher in 8 weeks." Be a little suspicious of this roadmap too. It's a map, not a contract.
But useful competence is months. You can build a GPT from scratch and read most papers in about 3 to 6 months. You can run a clean extension of a published result in about 6 to 12.
"Researcher" is a ladder, not a binary. You climb it by shipping artifacts, not by finishing courses.
The bottleneck is reps and taste, not information. Everything you need is free. What's scarce is the discipline to build, reproduce, and write. Over and over, and in public.

The 5 principles

1 · Build first, theory just-in-time.

Don't spend 6 months on linear algebra before touching a model. Build, hit a wall, learn exactly that wall's theory, continue.

2 · Spiral, don't sequence.

The phases are a competence map, not a strict order. A vertical slice early beats a perfect horizontal foundation.

3 · Artifacts over courses.

A finished course is worth nothing; a reproduced result with a writeup is worth a lot. Every phase ends in something shippable.

4 · Reproduce before you innovate.

You earn the right to have ideas by first reproducing others'. 80% of the real learning lives here.

5 · Learn in public.

A private learner is an invisible one, with no feedback loop. So push to GitHub, write notes, join a community. This is the feedback mechanism and the network. It's not vanity. It's how you get better.

The cardinal sin of empirical ML is self-deception. You believe a result that isn't real, because you wanted it to be true. Almost everything in the research-method section is a defense against that one failure.

↗

The levels, and what "researcher" actually means

Define your target concretely. Each level is a capability you reach by a deliverable, not a credential. Your near-term target is L2.

Tourist

Runs notebooks, calls APIs, can't explain internals.

where most stop

Reproducer

Builds and trains a small model from a blank file. Reads most papers. Reproduces a simple result.

3–6 mo · Phases 1–2

Extender

Runs a clean ablation or extension of a paper. Knows the pretrain, post-train and eval stack. Ships a reproducible artifact.

6–12 mo · Phases 3–4

Contributor

Produces a small original result others use; engaged in a community; a workshop paper or serious blog result.

1–2 yr · Phase 5

Independent

Sets their own research direction; produces recognized original work.

2–4+ yr

* at about 10 to 15 focused hours a week. The levels are the point, not the clock. The L0 to L4 scheme and the timelines are my own framing, not a cited standard. See Provenance.

⌘

The mental model

A modern language model, end to end. Every phase zooms into one part of this pipeline. Keep it in your head and you'll understand a little more of it each week.

textinput

→

tokenizerPhase 2

→

embeddingsPhase 2

→

transformer blocks
attention + MLP ×NPhase 1–2

→

logits → next tokenoutput

It's trained by next-token prediction on a big corpus (Phase 3 · pretraining). It's shaped by SFT and RL on curated data (Phase 4 · post-training). It's measured by evals that lie to you in subtle ways (Phase 4 · eval). And making one box better than anyone else at small scale is the whole game (Phase 5 · research).

⊞

The phase map

Orientation

Environment, mental model, first vertical slice.

deliverable: a trained tiny model, day 1 · ~1 wk

Foundations

Backprop + neural nets + ML from scratch; math just-in-time.

deliverable: micrograd + makemore, reimplemented · 4–8 wk

Transformers & LLMs

Attention, build + train a GPT, the full small-LLM stack.

deliverable: a GPT trained from a blank file · 4–8 wk

Training & systems

Scaling laws, efficiency, kernels, data, then reproduce a result.

deliverable: one clean reproduction + ablation (your L2 credential) · 8–12 wk

Post-training & eval

SFT, RLHF/DPO/GRPO, evaluation as a discipline.

deliverable: a post-trained small model + honest eval · 4–8 wk

Specialization & research

A niche + your first original result + community + publishing.

deliverable: a shipped original artifact · ongoing

Running through every phase is the research method. Start it in week 1, not at the end.

⎇

Shared core + branches

ML research isn't one path. It's a shared trunk with several branches, and this roadmap is built that way:

🌳 The trunk (universal)

Phase 0–1 plus the research method. Backprop, optimization, neural nets, the craft of research. You need these whatever you specialize in.

🔠 LLM / small-models (your primary)

Phases 2–5. Transformers → language modeling → post-training → small models.

↺ RL branch (jump ↓)

A sibling off the same trunk, with its own theory: MDPs → policy gradients → PPO → RLHF/GRPO/RLVR.

🔀 They merge at the frontier

LLM post-training (Phase 4) is RL applied to language models. The branches meet at RLVR/GRPO.

Quick correction. RL is one of the three ML paradigms (supervised, self-supervised, reinforcement), not the "basis" of ML. The basis is the trunk above. You pick a branch at the specialization stage. The trunk is shared.

↻

The weekly cadence

The habit that compounds matters more than any single course. Four habits, every week, every phase. Miss the courses, keep the loop.

🔨

Build / reproduce

one slice of the current phase's project, and most of your hours.

📄

Read 2–3 papers

figures-first, each with a one-paragraph note.

✍️

Write one note

a log entry, a "what I got stuck on," a short explainer.

💬

Engage once

post a result, ask/answer in a community, read others' work.

Orientation

≈ 1 week Kill the "where do I start" paralysis by shipping a trained model in your first few days. You're a strong engineer, so the trap isn't capability. The trap is spending three weeks "preparing to learn." Don't.

Do this, in order

Environment (½ day). PyTorch, and either a consumer GPU or free Colab/Kaggle. Don't over-build it.
First vertical slice (1–2 days). Run fast.ai Lesson 1, or train nanoGPT on tiny-shakespeare. Watch a loss curve drop. You won't understand most of it, and that's fine.
Public scaffolding (½ day). A GitHub repo, a LOG.md research log (the single highest-ROI habit here), and a paper-notes file.
Join one community. The EleutherAI Discord. Just lurk for now.
Set your target. Write "my 6-month target is L2" in your log.

Milestone

A loss curve you produced, in your repo.
First LOG.md entry written.
First paper skimmed figures-first (try TinyStories) + 3-sentence note.
EleutherAI joined.

Traps. Tooling rabbit holes. Telling yourself "I need to understand it first" (no, run it, then understand it). Skipping the log.

Phase 1 · Foundations →

Foundations

≈ 4–8 weeks Understand and implement from a blank file the machinery under every model: backprop, gradient descent, a neural net, core ML. Your gap isn't coding. It's the intuition for why nets train, and the ML grammar that makes experiments interpretable.

What to learn (priority order)

Backpropagation. The chain rule on a computation graph. The single most important thing here. Implement reverse-mode autodiff and most of DL stops being magic.
Gradient descent & optimizers. SGD, momentum, Adam. Learning rate is the thing that matters most.
A neural net from scratch. MLP, activations, initialization.
Core ML. Cross-entropy, train/val/test, overfitting, regularization, bias–variance. The grammar of every experiment.
Math, just-in-time. Linear algebra, the chain rule, probability/cross-entropy/KL. Learn each one when a model forces you to.

Primary path

Karpathy's Neural Networks: Zero to Hero is the spine. Do the exercises. Build micrograd (a tiny autodiff engine, and this is backprop) and makemore (a char-level LM). For breadth, keep fast.ai or d2l.ai handy. For math, look things up in Mathematics for ML when you need them. And when a concept won't click, watch the matching chapter of 3Blue1Brown's Neural Networks series. It's the best visual intuition for nets and backprop there is.

Deliverable. Reimplement micrograd and makemore from memory (not copy-paste), pushed with READMEs. Bonus: write a "what backprop actually computes" explainer. Teaching it is how you find the holes in your understanding.

Milestone: you've finished when you can…

Implement reverse-mode autodiff from a blank file, and hand-derive one gradient to check it.
Build, train, and debug an MLP without copying (diagnose a bad LR from the curve).
Explain cross-entropy, the val split, and overfitting plus three fixes, in your own words.
Read an empirical ML paper and follow its training setup.

Traps. Watching instead of building (close the video and rebuild from a blank file). Going down a math-first detour. Copy-pasting your way to a fake "done."

← Orientation Phase 2 · Transformers →

Transformers & LLMs from scratch

≈ 4–8 weeks Build and train a GPT from a blank file, and understand every component. This is the core of modern model research. It's the difference between "I fine-tuned a model" and "I can reason about why it behaves the way it does."

What to learn

Self-attention. Q/K/V, multi-head, causal masking. The centerpiece, so implement it from scratch.
The transformer block. Attention + MLP, residuals, LayerNorm/RMSNorm.
Tokenization. BPE, and how the tokenizer shapes everything downstream (an underrated source of bugs).
Positional info. Learned vs RoPE, and why attention needs it (it's permutation-equivariant without it).
The full pipeline. Pretrain → SFT → eval, end to end, once.

Primary path

Karpathy's "Let's build GPT" and nanoGPT get you attention from a blank file. Sebastian Raschka's Build an LLM (From Scratch) is the deeper companion. Then run nanochat once to see the whole modern stack end to end. The model it makes is weak, but running the pipeline is the point. And for a visual feel of how attention moves information around, read Jay Alammar's The Illustrated Transformer next to the code.

Deliverable. A from-scratch GPT, plus one question you answered empirically (say, "how does final loss change with depth at fixed params?"). That shift from building to investigating is what makes this Phase 2 and not Phase 0.

Milestone

Implement multi-head causal self-attention from a blank file; explain every line.
Explain why attention needs positional info and how RoPE provides it.
Describe what a tokenizer does + one way it can hurt quality.
Train a GPT end-to-end; reproduce nanochat's pipeline once.

Traps. Treating attention as a formula to memorize (implement it, then visualize it). Skipping the tokenizer. Stopping at "it trains" instead of "a question answered."

← Foundations Phase 3 · Training & systems →

Training & systems

≈ 8–12 weeks The longest, deepest phase, and the one where your "massive gap" really closes. Learn how real models are trained efficiently, then reproduce a published result and run one clean ablation. This is your L1 to L2 transition.

What to learn

Scaling laws. Chinchilla: about 20 tokens per param is compute-optimal. The basis for "small but well-trained."
Efficiency and GPU systems. bf16, MFU, FlashAttention, a reading-level grasp of kernels (Triton) and parallelism. Know where time and memory go.
Data. Curation, filtering, dedup. It often beats architecture, and it's the highest-leverage, least-glamorous variable.
Optimization at scale. LR schedules, warmup, AdamW, Muon, gradient accumulation.
The small-models toolkit. Quantization (GPTQ/AWQ), distillation, pruning, efficient architectures.

Primary path

The flagship here is Stanford CS336, Language Modeling from Scratch (Spring 2026). The lectures are free and the assignments are public: tokenizer, FlashAttention2 in Triton, distributed training, Common-Crawl data, SFT and RL. Do the assignments. Then read the modded-nanogpt commit history like a textbook.

Deliverable (the important one). Reproduce a result on one GPU, then run ONE clean ablation (say, Muon vs AdamW) with at least 3 seeds and a same-size baseline. Ship the repo, the eval harness, a plot, and a writeup. This artifact is your L2 credential. Pressure-test the design with the research-buddy first.

Milestone

Explain Chinchilla, MFU; estimate train cost on your hardware.
Read a profiler trace and locate the bottleneck (compute/memory/IO).
Implement or clearly explain FlashAttention's idea (IO-aware exact attention).
Reproduce a published result on one GPU + a controlled ablation.

Traps. Lectures without assignments. Chasing the 8×H100 speedrun record. Confounded ablations (vary one thing, match the baseline, seed it). Under-weighting data.

← Transformers Phase 4 · Post-training →

Post-training & evaluation

≈ 4–8 weeks How a raw pretrained model becomes useful, and how to measure it without fooling yourself. Most applied model research today is post-training and eval, and evaluation is the most under-respected skill in the field.

What to learn

SFT / instruction tuning. Use LoRA/QLoRA to do it cheaply on one GPU.
Preference and RL. RLHF, DPO (no separate reward model), GRPO/RLVR (RL from verifiable rewards, the reasoning frontier, which is DeepSeek's method and drops the value critic).
Reward modeling. Reward hacking, and why a verifier with skin in the game can't be fair.
Evaluation as a discipline. Contamination, prompt sensitivity, metric≠behavior, comparative > absolute. The most important sub-topic here.

Primary path

Your spine is Nathan Lambert's The RLHF Book (free online): the canonical recipe, DPO, the RLVR renaissance, reward modeling, eval. Use Hugging Face TRL to actually post-train a small model.

Deliverable. Post-train a 0.5 to 1.5B model (SFT with LoRA, then DPO, or a small GRPO with a verifiable reward) and evaluate it honestly: a clean harness, a same-size baseline, and a stated way your reward could be gamed plus what the metric misses.

Milestone

Explain SFT vs DPO vs GRPO. What each optimizes, and when to use which.
Describe reward hacking with a concrete example.
Name ≥3 ways an eval can lie.
Post-train a small model with TRL, plus an honest evaluation with a baseline.

Traps. Trusting your own numbers (assume the eval lies until you've checked). Reaching for GRPO first (earn it with SFT and DPO). Letting a model grade its own work. Metric tunnel-vision.

← Training & systems Phase 5 · Research →

Specialization & doing research

ongoing Stop following a curriculum and start doing research. Pick a niche, produce your first original result, engage a community, ship it publicly. That shift, from being handed a task to choosing your own, is the whole definition of a researcher.

1 · Pick a niche (go deep, not wide)

Choose one where you can build a real result on one GPU: efficiency (quant/distill/prune), mechanistic interpretability (small models are interpretable; the community rewards small clean results, so see ARENA + TransformerLens), data curation, or small reasoning / post-training. Your on-ramp doc is the detailed specialization guide.

2 · Produce your first original result

Reproduce something in your niche (the launchpad now, not the goal).
Find the open thread. The cheap ablation the paper didn't run. Reading deeply is idea generation.
Scope it with the research-buddy: prior-art + feasibility + confound.
Run it cleanly. One axis, ≥3 seeds, same-size baseline.
Ship the artifact. Open weights/code + reproducible eval + honest baselines + writeup. Your L3 credential.

3 · Get visible & find your people

Open-source-as-research is the modern credential. EleutherAI (#research + the SOAR program) is the highest-ROI single move. It gets you collaborators, granted compute, and the arXiv-endorsement gate, all solved at once. Realistic venues: the ICLR Blog Posts track, NeurIPS ENLSP workshop, the ML Reproducibility Challenge.

Milestone: you're operating as a researcher (L3) when…

You've shipped an original result with open code + honest eval, publicly.
Someone you don't know has used/cited/built on it.
You're active in a research community.
You can read a frontier paper and immediately see the next experiment.
You choose your own questions.

Traps. Niche-hopping (depth compounds, breadth doesn't). Reaching for novelty too early (your first "original" is a clean extension). Building tools instead of doing research (yes, that includes over-investing in the research-buddy). Working alone in a corner.

← Post-training Research method →

↺

The RL branch: Reinforcement Learning

A sibling to the LLM branch, off the same trunk (Phase 0–1 plus the research method). Same philosophy: build-first, reproduce before you innovate, artifacts over courses. This just adds the RL-specific theory the LLM phases don't teach.

Two lanes, and which to pick

Classic deep RL

Games, robotics, control. The older lineage. Harder solo. It's sample-inefficient, hungry for compute and wall-clock, brittle, and it has a real reproducibility problem ("Deep RL That Matters"). Pick it for love of the control problem, not for tractability.

RL-for-LLMs / RLVR ★

Reasoning models. Hottest area, most jobs, and most tractable on one GPU, because it reuses your whole LLM skillset. This is the lane to take unless robots and games are the dream.

Build-first warning. Even veterans call Sutton & Barto a slog. Do NOT read it cover-to-cover first. Ship a working agent early (HF Deep RL course or CleanRL), hit a wall, then pull the theory in. RL punishes "learn everything first" harder than any branch.

The phases

RL-0 · days Orientation

Train one working agent before you understand it (HF Deep RL course, Unit 1, on Gymnasium + Stable-Baselines3). Pick your lane. Artifact: a reward curve in your repo. Trap: starting with Sutton & Barto ch.1. Train something first.

RL-1 · 4–6 wk Foundations, the "micrograd of RL"

Learn: MDPs, returns, value functions (V/Q), Bellman, dynamic programming, exploration vs exploitation, tabular Q-learning/SARSA/TD. Primary: Sutton & Barto Part I + David Silver lectures (read alongside code). Artifact: tabular Q-learning from a blank file on a gridworld/FrozenLake. Trap: jumping to deep RL before tabular intuition.

RL-2 · 8–12 wk · L2 credential Deep RL

Learn: function approximation, DQN, policy gradients (REINFORCE), actor-critic, PPO (know it cold), then SAC/DDPG/TD3. Primary: OpenAI Spinning Up + CleanRL (reproduce the single-file impls) + Gymnasium + Stable-Baselines3. Artifact: reproduce PPO on CartPole→LunarLander + one clean ablation, ≥3 seeds. Trap: reproducibility hell. RL variance is brutal, and one seed is a lie.

RL-3 · 4–8 wk · = LLM Phase 4 The LLM intersection: RLHF / GRPO / RLVR

Where this branch rejoins the LLM track. Learn: reward modeling and hacking, RLHF (PPO-for-LLMs), DPO, GRPO (DeepSeek, critic-free, group-normalized), RLVR (verifiable rewards). Primary: the RLHF Book + TRL + Open-Reasoner-Zero/DAPO. Artifact: a small GRPO/RLVR run on a 0.5–1.5B model with a verifiable reward + honest eval. Trap: GRPO is finicky (KL, reward hacking, vLLM), so earn it with SFT and DPO first. And watch the RLVR debate ("faster, not smarter"), so you don't overclaim.

RL-4 · ongoing · L3 Specialization & research

Pick one lane and ship an original result: RL-for-reasoning (most tractable solo) · small-scale classic deep RL · model-based · offline RL · exploration · multi-agent RL (MARL), which is the real multi-agent RL, not LLM-orchestration. Reproduction-as-contribution counts here (RL's reproducibility crisis makes a clean seeded repro genuinely valuable).

For a solo researcher on one GPU, take the RLVR and reasoning lane (RL-3). It's the frontier, the compute is cheap-ish, and it reuses your LLM track. Classic deep RL is worthy but less forgiving. Same trunk, same method. The branch is different, but the work is the same.

RL resources (verified)

HF Deep RL Course · start here (RL-0)
Sutton & Barto, Reinforcement Learning: An Introduction (free) + David Silver lectures · RL-1
OpenAI Spinning Up + CleanRL + Gymnasium + Stable-Baselines3 · RL-2
RLHF Book + HF TRL + Open-Reasoner-Zero/DAPO · RL-3 (shared with LLM Phase 4)
RL Field Manual, an interactive guide to LLM reinforcement learning · RL-3 (the frontier)

Full detail in rl-track.md.

← Specialization Research method →

✦

Research method: the craft

Learning ML and being a researcher are different skills. This is the second one: how to read, reproduce, experiment, write, and not fool yourself. Start it in week 1.

The research loop: question → hypothesis → minimal experiment → result → interpret skeptically → write it down → next question. Keep each turn small. It's iteration speed, not raw intelligence, that separates productive researchers from stuck ones.

Reading papers

Multi-pass: skim (figures + results, 5 min) → method + baseline (15–30) → deep only for the few that matter. Always write a 3-sentence note. Read with the right question. Not "is this true?" but "how was this measured, and how could it be wrong?" That question is most of taste. Read 2–3 a week, forever.

Reproduction

Most of your learning and credibility come from here, and it's a valued contribution in its own right. Reproduce before you extend. Containerize (Docker/uv). When your numbers don't match the paper, that gap is the most educational thing in the process.

Experiment design

One independent variable. The right (same-size) baseline. ≥3 seeds. Ablate with and without the component. Control confounds (match params and FLOPs, keep calibration and eval disjoint). Pre-register what would confirm or refute it, before you run it.

The research log

A dated LOG.md: what I tried, what happened, what I learned, what's next. It's your external memory and the raw material of every writeup. Start it in Phase 0, and never skip.

Writing

Unwritten research barely exists. Write to find your errors. If you can't write it clearly, you don't understand it. State limitations honestly. It builds more credibility than overclaiming. Teach to learn.

Feedback & taste

Seek harsh feedback early. The person who finds your confound is doing you a favor. Before you believe a result, try to refute it. Taste is pattern-matching built by volume. You can't shortcut it, you can only run the loop faster.

The honest meta-point. While building this very project, the assistant confidently called a real API "hallucinated." It was false, and got caught only because someone pushed back and actually checked. That's the whole discipline in one anecdote: confidence is not evidence. Verify, and prefer being corrected to being wrong.

❖

Resources

Curated, not exhaustive. One primary per phase, and go deep. Collecting resources is procrastination. Finishing one is progress.

Hands-on (the spine, do these)

Resource	Phase	What
Karpathy's Zero to Hero	1–2	From-scratch: micrograd, makemore, build-GPT.
3Blue1Brown's Neural Networks	1	The best visual intuition for nets, gradients, and backprop.
nanoGPT / nanochat	2	Minimal GPT to study; full modern stack end-to-end.
The Illustrated Transformer	2	Jay Alammar's visual walk through attention. Read it next to the code.
Raschka's Build an LLM from Scratch	2	Thorough code-first book + repo.
fast.ai / d2l.ai	0–3	Build-first DL course; interactive textbook (reference).
Stanford CS336 (Spring 2026)	3–4	The systems flagship. Free lectures + assignments.
Hugging Face TRL	4	Practical SFT / DPO / GRPO toolkit.
ARENA / TransformerLens	5	Interp / research-engineering curriculum + library.
modded-nanogpt	3	Speedrun repo; commit history = efficiency masterclass.

Books & long-form

The RLHF Book, by Nathan Lambert (Phase 4; free online)
Mathematics for ML, Deisenroth et al. (Phase 1; math reference, free)
Deep Learning, Goodfellow et al. (theory reference; dip in)

Foundational papers (read them, don't just cite them)

Attention Is All You Need · the Transformer · Phase 2
Improving Language Understanding by Generative Pre-Training · GPT-1 · Phase 2
BERT · bidirectional pre-training · Phase 2
DeepSeek-R1 · RL for reasoning, the GRPO breakthrough · Phase 4 / RL-3

People / newsletters (signal, low noise)

Sebastian Raschka's Ahead of AI (highest signal; annual reading lists)
Lilian Weng (deep technical explainers) · Neel Nanda (mech interp) · Nathan Lambert's Interconnects (post-training)

Community, compute, venues

EleutherAI. #research + SOAR. The highest-ROI single move.
Compute. Free Colab/Kaggle → Vast/RunPod community tiers → granted (TPU Research Cloud, academic credits).
Venues for independents. ICLR Blog Posts track · NeurIPS ENLSP · ML Reproducibility Challenge. Not the main-track lottery.

✓

Progress tracker

Check a box only when you can do it from a blank file / for real, not "I watched it." Saved in your browser.

0 / 0 complete

⚐

Provenance & verification

How this was built, and how much to trust each part. Because the right answer to "is this reliable?" is to show the receipts, not to assert.

Tier 1 · Verified

Resources and facts checked against live 2026 sources, plus a chunk adversarially fact-checked earlier. Load-bearing technical claims confirmed:

Chinchilla ≈ 20 tokens/param compute-optimal	✓ Epoch / Hoffmann 2022
Muon optimizer (Keller Jordan, modded-nanogpt)	✓ source
GRPO (DeepSeek; drops value critic; RLVR)	✓ arXiv
FlashAttention (IO-aware exact attention)	✓ Dao 2022
CS336 Spring 2026, RLHF Book, ARENA, fast.ai/d2l	✓ live-verified

Tier 2 · Standard knowledge

Textbook claims (attention permutation-equivariance, RoPE, GPTQ/AWQ, Hinton/Kim-Rush distillation, cross-entropy). Field-standard and correct to current knowledge, not each re-derived here.

Tier 3 · Editorial judgment

The pedagogical architecture: the six phases, the L0 to L4 ladder, the phase boundaries, the time estimates. It's a synthesis of how the field's best teachers say to learn (Karpathy, fast.ai, CS336, the reproducibility literature). It's well-grounded, because that consensus is strong and convergent, but it's still a considered opinion, not a cited authority. Pressure-test the structure against any working researcher.

Bottom line: the direction is trustworthy and the concrete facts are verified. The exact phase math deserves a pinch of salt. And it's cheap to check. Every link is real, and the structure survives a five-minute sanity check with the EleutherAI #research community.