Overview & philosophy
Most people don't fail this for lack of resources. There are too many resources. They fail because they learn in the wrong shape. So the philosophy below matters more than the curriculum does.
The honest premise
- Mastery is years, not months. Be suspicious of anyone selling "AI researcher in 8 weeks." Be a little suspicious of this roadmap too. It's a map, not a contract.
- But useful competence is months. You can build a GPT from scratch and read most papers in about 3 to 6 months. You can run a clean extension of a published result in about 6 to 12.
- "Researcher" is a ladder, not a binary. You climb it by shipping artifacts, not by finishing courses.
- The bottleneck is reps and taste, not information. Everything you need is free. What's scarce is the discipline to build, reproduce, and write. Over and over, and in public.
The 5 principles
Don't spend 6 months on linear algebra before touching a model. Build, hit a wall, learn exactly that wall's theory, continue.
The phases are a competence map, not a strict order. A vertical slice early beats a perfect horizontal foundation.
A finished course is worth nothing; a reproduced result with a writeup is worth a lot. Every phase ends in something shippable.
You earn the right to have ideas by first reproducing others'. 80% of the real learning lives here.
A private learner is an invisible one, with no feedback loop. So push to GitHub, write notes, join a community. This is the feedback mechanism and the network. It's not vanity. It's how you get better.
The levels, and what "researcher" actually means
Define your target concretely. Each level is a capability you reach by a deliverable, not a credential. Your near-term target is L2.
* at about 10 to 15 focused hours a week. The levels are the point, not the clock. The L0 to L4 scheme and the timelines are my own framing, not a cited standard. See Provenance.
The mental model
A modern language model, end to end. Every phase zooms into one part of this pipeline. Keep it in your head and you'll understand a little more of it each week.
attention + MLP ×NPhase 1–2
It's trained by next-token prediction on a big corpus (Phase 3 · pretraining). It's shaped by SFT and RL on curated data (Phase 4 · post-training). It's measured by evals that lie to you in subtle ways (Phase 4 · eval). And making one box better than anyone else at small scale is the whole game (Phase 5 · research).
The phase map
Running through every phase is the research method. Start it in week 1, not at the end.
Shared core + branches
ML research isn't one path. It's a shared trunk with several branches, and this roadmap is built that way:
🌳 The trunk (universal)
Phase 0–1 plus the research method. Backprop, optimization, neural nets, the craft of research. You need these whatever you specialize in.
🔠 LLM / small-models (your primary)
Phases 2–5. Transformers → language modeling → post-training → small models.
↺ RL branch (jump ↓)
A sibling off the same trunk, with its own theory: MDPs → policy gradients → PPO → RLHF/GRPO/RLVR.
🔀 They merge at the frontier
LLM post-training (Phase 4) is RL applied to language models. The branches meet at RLVR/GRPO.
The weekly cadence
The habit that compounds matters more than any single course. Four habits, every week, every phase. Miss the courses, keep the loop.
Orientation
≈ 1 week Kill the "where do I start" paralysis by shipping a trained model in your first few days. You're a strong engineer, so the trap isn't capability. The trap is spending three weeks "preparing to learn." Don't.
Do this, in order
- Environment (½ day). PyTorch, and either a consumer GPU or free Colab/Kaggle. Don't over-build it.
- First vertical slice (1–2 days). Run fast.ai Lesson 1, or train nanoGPT on tiny-shakespeare. Watch a loss curve drop. You won't understand most of it, and that's fine.
- Public scaffolding (½ day). A GitHub repo, a
LOG.mdresearch log (the single highest-ROI habit here), and a paper-notes file. - Join one community. The EleutherAI Discord. Just lurk for now.
- Set your target. Write "my 6-month target is L2" in your log.
Milestone
- A loss curve you produced, in your repo.
- First
LOG.mdentry written. - First paper skimmed figures-first (try TinyStories) + 3-sentence note.
- EleutherAI joined.
Foundations
≈ 4–8 weeks Understand and implement from a blank file the machinery under every model: backprop, gradient descent, a neural net, core ML. Your gap isn't coding. It's the intuition for why nets train, and the ML grammar that makes experiments interpretable.
What to learn (priority order)
- Backpropagation. The chain rule on a computation graph. The single most important thing here. Implement reverse-mode autodiff and most of DL stops being magic.
- Gradient descent & optimizers. SGD, momentum, Adam. Learning rate is the thing that matters most.
- A neural net from scratch. MLP, activations, initialization.
- Core ML. Cross-entropy, train/val/test, overfitting, regularization, bias–variance. The grammar of every experiment.
- Math, just-in-time. Linear algebra, the chain rule, probability/cross-entropy/KL. Learn each one when a model forces you to.
Primary path
Karpathy's Neural Networks: Zero to Hero is the spine. Do the exercises. Build micrograd (a tiny autodiff engine, and this is backprop) and makemore (a char-level LM). For breadth, keep fast.ai or d2l.ai handy. For math, look things up in Mathematics for ML when you need them. And when a concept won't click, watch the matching chapter of 3Blue1Brown's Neural Networks series. It's the best visual intuition for nets and backprop there is.
Milestone: you've finished when you can…
- Implement reverse-mode autodiff from a blank file, and hand-derive one gradient to check it.
- Build, train, and debug an MLP without copying (diagnose a bad LR from the curve).
- Explain cross-entropy, the val split, and overfitting plus three fixes, in your own words.
- Read an empirical ML paper and follow its training setup.
Transformers & LLMs from scratch
≈ 4–8 weeks Build and train a GPT from a blank file, and understand every component. This is the core of modern model research. It's the difference between "I fine-tuned a model" and "I can reason about why it behaves the way it does."
What to learn
- Self-attention. Q/K/V, multi-head, causal masking. The centerpiece, so implement it from scratch.
- The transformer block. Attention + MLP, residuals, LayerNorm/RMSNorm.
- Tokenization. BPE, and how the tokenizer shapes everything downstream (an underrated source of bugs).
- Positional info. Learned vs RoPE, and why attention needs it (it's permutation-equivariant without it).
- The full pipeline. Pretrain → SFT → eval, end to end, once.
Primary path
Karpathy's "Let's build GPT" and nanoGPT get you attention from a blank file. Sebastian Raschka's Build an LLM (From Scratch) is the deeper companion. Then run nanochat once to see the whole modern stack end to end. The model it makes is weak, but running the pipeline is the point. And for a visual feel of how attention moves information around, read Jay Alammar's The Illustrated Transformer next to the code.
Milestone
- Implement multi-head causal self-attention from a blank file; explain every line.
- Explain why attention needs positional info and how RoPE provides it.
- Describe what a tokenizer does + one way it can hurt quality.
- Train a GPT end-to-end; reproduce nanochat's pipeline once.
Training & systems
≈ 8–12 weeks The longest, deepest phase, and the one where your "massive gap" really closes. Learn how real models are trained efficiently, then reproduce a published result and run one clean ablation. This is your L1 to L2 transition.
What to learn
- Scaling laws. Chinchilla: about 20 tokens per param is compute-optimal. The basis for "small but well-trained."
- Efficiency and GPU systems. bf16, MFU, FlashAttention, a reading-level grasp of kernels (Triton) and parallelism. Know where time and memory go.
- Data. Curation, filtering, dedup. It often beats architecture, and it's the highest-leverage, least-glamorous variable.
- Optimization at scale. LR schedules, warmup, AdamW, Muon, gradient accumulation.
- The small-models toolkit. Quantization (GPTQ/AWQ), distillation, pruning, efficient architectures.
Primary path
The flagship here is Stanford CS336, Language Modeling from Scratch (Spring 2026). The lectures are free and the assignments are public: tokenizer, FlashAttention2 in Triton, distributed training, Common-Crawl data, SFT and RL. Do the assignments. Then read the modded-nanogpt commit history like a textbook.
Milestone
- Explain Chinchilla, MFU; estimate train cost on your hardware.
- Read a profiler trace and locate the bottleneck (compute/memory/IO).
- Implement or clearly explain FlashAttention's idea (IO-aware exact attention).
- Reproduce a published result on one GPU + a controlled ablation.
Post-training & evaluation
≈ 4–8 weeks How a raw pretrained model becomes useful, and how to measure it without fooling yourself. Most applied model research today is post-training and eval, and evaluation is the most under-respected skill in the field.
What to learn
- SFT / instruction tuning. Use LoRA/QLoRA to do it cheaply on one GPU.
- Preference and RL. RLHF, DPO (no separate reward model), GRPO/RLVR (RL from verifiable rewards, the reasoning frontier, which is DeepSeek's method and drops the value critic).
- Reward modeling. Reward hacking, and why a verifier with skin in the game can't be fair.
- Evaluation as a discipline. Contamination, prompt sensitivity, metric≠behavior, comparative > absolute. The most important sub-topic here.
Primary path
Your spine is Nathan Lambert's The RLHF Book (free online): the canonical recipe, DPO, the RLVR renaissance, reward modeling, eval. Use Hugging Face TRL to actually post-train a small model.
Milestone
- Explain SFT vs DPO vs GRPO. What each optimizes, and when to use which.
- Describe reward hacking with a concrete example.
- Name ≥3 ways an eval can lie.
- Post-train a small model with TRL, plus an honest evaluation with a baseline.
Specialization & doing research
ongoing Stop following a curriculum and start doing research. Pick a niche, produce your first original result, engage a community, ship it publicly. That shift, from being handed a task to choosing your own, is the whole definition of a researcher.
1 · Pick a niche (go deep, not wide)
Choose one where you can build a real result on one GPU: efficiency (quant/distill/prune), mechanistic interpretability (small models are interpretable; the community rewards small clean results, so see ARENA + TransformerLens), data curation, or small reasoning / post-training. Your on-ramp doc is the detailed specialization guide.
2 · Produce your first original result
- Reproduce something in your niche (the launchpad now, not the goal).
- Find the open thread. The cheap ablation the paper didn't run. Reading deeply is idea generation.
- Scope it with the research-buddy: prior-art + feasibility + confound.
- Run it cleanly. One axis, ≥3 seeds, same-size baseline.
- Ship the artifact. Open weights/code + reproducible eval + honest baselines + writeup. Your L3 credential.
3 · Get visible & find your people
Open-source-as-research is the modern credential. EleutherAI (#research + the SOAR program) is the highest-ROI single move. It gets you collaborators, granted compute, and the arXiv-endorsement gate, all solved at once. Realistic venues: the ICLR Blog Posts track, NeurIPS ENLSP workshop, the ML Reproducibility Challenge.
Milestone: you're operating as a researcher (L3) when…
- You've shipped an original result with open code + honest eval, publicly.
- Someone you don't know has used/cited/built on it.
- You're active in a research community.
- You can read a frontier paper and immediately see the next experiment.
- You choose your own questions.
The RL branch: Reinforcement Learning
A sibling to the LLM branch, off the same trunk (Phase 0–1 plus the research method). Same philosophy: build-first, reproduce before you innovate, artifacts over courses. This just adds the RL-specific theory the LLM phases don't teach.
Two lanes, and which to pick
Classic deep RL
Games, robotics, control. The older lineage. Harder solo. It's sample-inefficient, hungry for compute and wall-clock, brittle, and it has a real reproducibility problem ("Deep RL That Matters"). Pick it for love of the control problem, not for tractability.
RL-for-LLMs / RLVR ★
Reasoning models. Hottest area, most jobs, and most tractable on one GPU, because it reuses your whole LLM skillset. This is the lane to take unless robots and games are the dream.
The phases
Train one working agent before you understand it (HF Deep RL course, Unit 1, on Gymnasium + Stable-Baselines3). Pick your lane. Artifact: a reward curve in your repo. Trap: starting with Sutton & Barto ch.1. Train something first.
Learn: MDPs, returns, value functions (V/Q), Bellman, dynamic programming, exploration vs exploitation, tabular Q-learning/SARSA/TD. Primary: Sutton & Barto Part I + David Silver lectures (read alongside code). Artifact: tabular Q-learning from a blank file on a gridworld/FrozenLake. Trap: jumping to deep RL before tabular intuition.
Learn: function approximation, DQN, policy gradients (REINFORCE), actor-critic, PPO (know it cold), then SAC/DDPG/TD3. Primary: OpenAI Spinning Up + CleanRL (reproduce the single-file impls) + Gymnasium + Stable-Baselines3. Artifact: reproduce PPO on CartPole→LunarLander + one clean ablation, ≥3 seeds. Trap: reproducibility hell. RL variance is brutal, and one seed is a lie.
Where this branch rejoins the LLM track. Learn: reward modeling and hacking, RLHF (PPO-for-LLMs), DPO, GRPO (DeepSeek, critic-free, group-normalized), RLVR (verifiable rewards). Primary: the RLHF Book + TRL + Open-Reasoner-Zero/DAPO. Artifact: a small GRPO/RLVR run on a 0.5–1.5B model with a verifiable reward + honest eval. Trap: GRPO is finicky (KL, reward hacking, vLLM), so earn it with SFT and DPO first. And watch the RLVR debate ("faster, not smarter"), so you don't overclaim.
Pick one lane and ship an original result: RL-for-reasoning (most tractable solo) · small-scale classic deep RL · model-based · offline RL · exploration · multi-agent RL (MARL), which is the real multi-agent RL, not LLM-orchestration. Reproduction-as-contribution counts here (RL's reproducibility crisis makes a clean seeded repro genuinely valuable).
RL resources (verified)
- HF Deep RL Course · start here (RL-0)
- Sutton & Barto, Reinforcement Learning: An Introduction (free) + David Silver lectures · RL-1
- OpenAI Spinning Up + CleanRL + Gymnasium + Stable-Baselines3 · RL-2
- RLHF Book + HF TRL + Open-Reasoner-Zero/DAPO · RL-3 (shared with LLM Phase 4)
- RL Field Manual, an interactive guide to LLM reinforcement learning · RL-3 (the frontier)
Full detail in rl-track.md.
Research method: the craft
Learning ML and being a researcher are different skills. This is the second one: how to read, reproduce, experiment, write, and not fool yourself. Start it in week 1.
Reading papers
Multi-pass: skim (figures + results, 5 min) → method + baseline (15–30) → deep only for the few that matter. Always write a 3-sentence note. Read with the right question. Not "is this true?" but "how was this measured, and how could it be wrong?" That question is most of taste. Read 2–3 a week, forever.
Reproduction
Most of your learning and credibility come from here, and it's a valued contribution in its own right. Reproduce before you extend. Containerize (Docker/uv). When your numbers don't match the paper, that gap is the most educational thing in the process.
Experiment design
One independent variable. The right (same-size) baseline. ≥3 seeds. Ablate with and without the component. Control confounds (match params and FLOPs, keep calibration and eval disjoint). Pre-register what would confirm or refute it, before you run it.
The research log
A dated LOG.md: what I tried, what happened, what I learned, what's next. It's your external memory and the raw material of every writeup. Start it in Phase 0, and never skip.
Writing
Unwritten research barely exists. Write to find your errors. If you can't write it clearly, you don't understand it. State limitations honestly. It builds more credibility than overclaiming. Teach to learn.
Feedback & taste
Seek harsh feedback early. The person who finds your confound is doing you a favor. Before you believe a result, try to refute it. Taste is pattern-matching built by volume. You can't shortcut it, you can only run the loop faster.
Resources
Curated, not exhaustive. One primary per phase, and go deep. Collecting resources is procrastination. Finishing one is progress.
Hands-on (the spine, do these)
| Resource | Phase | What |
|---|---|---|
| Karpathy's Zero to Hero | 1–2 | From-scratch: micrograd, makemore, build-GPT. |
| 3Blue1Brown's Neural Networks | 1 | The best visual intuition for nets, gradients, and backprop. |
| nanoGPT / nanochat | 2 | Minimal GPT to study; full modern stack end-to-end. |
| The Illustrated Transformer | 2 | Jay Alammar's visual walk through attention. Read it next to the code. |
| Raschka's Build an LLM from Scratch | 2 | Thorough code-first book + repo. |
| fast.ai / d2l.ai | 0–3 | Build-first DL course; interactive textbook (reference). |
| Stanford CS336 (Spring 2026) | 3–4 | The systems flagship. Free lectures + assignments. |
| Hugging Face TRL | 4 | Practical SFT / DPO / GRPO toolkit. |
| ARENA / TransformerLens | 5 | Interp / research-engineering curriculum + library. |
| modded-nanogpt | 3 | Speedrun repo; commit history = efficiency masterclass. |
Books & long-form
- The RLHF Book, by Nathan Lambert (Phase 4; free online)
- Mathematics for ML, Deisenroth et al. (Phase 1; math reference, free)
- Deep Learning, Goodfellow et al. (theory reference; dip in)
Foundational papers (read them, don't just cite them)
- Attention Is All You Need · the Transformer · Phase 2
- Improving Language Understanding by Generative Pre-Training · GPT-1 · Phase 2
- BERT · bidirectional pre-training · Phase 2
- DeepSeek-R1 · RL for reasoning, the GRPO breakthrough · Phase 4 / RL-3
People / newsletters (signal, low noise)
- Sebastian Raschka's Ahead of AI (highest signal; annual reading lists)
- Lilian Weng (deep technical explainers) · Neel Nanda (mech interp) · Nathan Lambert's Interconnects (post-training)
Community, compute, venues
- EleutherAI. #research + SOAR. The highest-ROI single move.
- Compute. Free Colab/Kaggle → Vast/RunPod community tiers → granted (TPU Research Cloud, academic credits).
- Venues for independents. ICLR Blog Posts track · NeurIPS ENLSP · ML Reproducibility Challenge. Not the main-track lottery.
Progress tracker
Check a box only when you can do it from a blank file / for real, not "I watched it." Saved in your browser.
Provenance & verification
How this was built, and how much to trust each part. Because the right answer to "is this reliable?" is to show the receipts, not to assert.
Resources and facts checked against live 2026 sources, plus a chunk adversarially fact-checked earlier. Load-bearing technical claims confirmed:
| Chinchilla ≈ 20 tokens/param compute-optimal | ✓ Epoch / Hoffmann 2022 |
| Muon optimizer (Keller Jordan, modded-nanogpt) | ✓ source |
| GRPO (DeepSeek; drops value critic; RLVR) | ✓ arXiv |
| FlashAttention (IO-aware exact attention) | ✓ Dao 2022 |
| CS336 Spring 2026, RLHF Book, ARENA, fast.ai/d2l | ✓ live-verified |
Textbook claims (attention permutation-equivariance, RoPE, GPTQ/AWQ, Hinton/Kim-Rush distillation, cross-entropy). Field-standard and correct to current knowledge, not each re-derived here.
The pedagogical architecture: the six phases, the L0 to L4 ladder, the phase boundaries, the time estimates. It's a synthesis of how the field's best teachers say to learn (Karpathy, fast.ai, CS336, the reproducibility literature). It's well-grounded, because that consensus is strong and convergent, but it's still a considered opinion, not a cited authority. Pressure-test the structure against any working researcher.
Bottom line: the direction is trustworthy and the concrete facts are verified. The exact phase math deserves a pinch of salt. And it's cheap to check. Every link is real, and the structure survives a five-minute sanity check with the EleutherAI #research community.