I Spent Two Weeks Grinding RL Interview Questions So You Don't Have To — Here's What Actually Matter
I Spent Two Weeks Grinding RL Interview Questions So You Don't Have To — Here's What Actually Matter
Last year, a friend of mine got a PhD offer and turned it down. Just skipped straight to industry recruiting season instead.
Landed a massive offer.
I was sitting there staring at my half-finished thesis, and exactly one thought was running through my brain: Should I just bail on academia and cash out too?
So I went deep. Spent weeks collecting every RL interview question I could find. And here's the thing — RL interview prep material is scattered like someone dropped a box of thumbtacks in the dark. Some questions were buried in generalist interview roundups. Others were hiding in comment sections under Agent interview guides. A few posts promised "AI interview experience" and turned out to be thinly disguised course ads.
Two weeks of distilling everything. Plus conversations with a handful of actual interviewers. The result: 35 questions that actually matter.
Honestly? It was way more exhausting than I expected.
Let me share some observations you won't find in the standard prep guides.
The Algorithm/Infra Divide? It Doesn't Exist Anymore
Here's something nobody tells you: modern RL interviews don't separate algorithm roles from infrastructure roles.
I was interviewing at a company — I won't say which — and the conversation was flowing nicely. We're talking about advantage calculation in PPO, I'm feeling good. Then the interviewer hits me with: "Without CPU offloading, how many models sit in GPU memory during GRPO training?"
I froze for a second.
Let me walk you through the math I worked out later. GRPO crams three models into VRAM: the Actor, the Reference Model, and the Reward Model. Compared to PPO, it ditches the Critic entirely — GRPO uses group-relative advantage instead of a value network. For a 7B parameter model at BF16 precision, each model takes 14GB. Three models: 42GB. Add the Actor's optimizer states (28GB) and gradients (14GB), and you're looking at roughly 84GB total.
Wild, right?
If you enable CPU offloading and push the Reference and Reward models off the GPU, you can squeeze down to 56GB. Offload the optimizer too, and you're at 28GB — just the Actor on GPU. But latency explodes. Communication bottlenecks become brutal. The most practical setup I've seen is Reference + Reward + Optimizer all offloaded, saving about 64% VRAM, with layer-wise offloading and communication-computation overlap to keep the pain manageable.
This moment crystallized something brutal for me: if you can't calculate memory usage by hand, you won't pass an RL interview. I'm serious. I went home that night and started hand-calculating memory footprints for different model configurations. Was still at it at 2 AM.
The Algorithm Questions Are Actually... Fun?
Some questions I genuinely enjoyed.
One in particular: Why use Actor-Critic instead of pure Critic methods? Pure Critic approaches — DQN, I'm looking at you — are basically a disaster in continuous action spaces. Think about it. Every timestep requires a global optimization over the Q-function, and when your action space is continuous, that's computationally intractable. Actor-Critic parameterizes the policy as a separate network, gradients flow directly through policy parameters, and continuous spaces become natural to handle.
Then there's the variance problem. REINFORCE has absurdly high variance in its gradient estimates because it uses full returns. Introduce a learned Critic as a baseline — subtract a function that's approximately independent of the action — and variance drops dramatically while the expectation stays unchanged. You can prove this rigorously, but in an interview you just need to nail the intuition.
Where I Completely Faceplanted
My first interview, I got asked about Actor-Critic considerations specific to LLM scenarios. I hadn't prepped that deep.
Turns out, the LLM action space is token-level and absolutely enormous. The Critic has to estimate values for individual token sequences, and this value function's variance is way higher than traditional RL settings. That's exactly why methods like GRPO just drop the Critic altogether and use group-relative comparisons to compute advantage.
Think about the design logic here — it's actually super straightforward once you see it, but if you haven't stumbled into this problem yourself, you'd never think of it.
MoE RL Is a Deep, Dark Rabbit Hole
I've read DeepSeek's technical reports more times than I care to admit. The R1 series clearly focuses on GRPO, multi-stage RL, cold starts, and rejection sampling. V3 emphasizes MoE architecture, aux-loss-free load balancing, MLA, and MTP — without fully revealing RL details. V3.2's main public innovation is DeepSeek Sparse Attention for long-context training and inference efficiency. As for V4 — as of June 2026, I haven't found any reliable official technical report.
Can't make stuff up.
But the difficulties of MoE RL are fair game. Token-level logprob ratios have higher variance under MoE because expert routing can shift per token. Uneven expert load distribution destabilizes training. Different parallelism strategies between training and rollout engines cause logprob and routing consistency issues. GSPO — sequence-level objectives — were reported by the Qwen team to stabilize MoE RL because sequence-level ratios are less sensitive to local token fluctuation.
This approach — wait, I should call it a strategy — is pretty clever. Bump the ratio computation granularity from token-level to sequence-level, and MoE's routing noise gets smoothed out naturally. Put another way: stop fighting token-level noise head-on and just change your perspective.
The Real Nature of RL Infrastructure
It's not about training.
It's about gluing inference systems and training systems into a pipeline that doesn't leak.
ByteDance's verl framework is doing exactly this. It doesn't force training and inference into the same parallelism mold — it accepts that they're fundamentally different beasts. Training cares about backward passes and heavy communication. Inference cares about throughput and long-tail scheduling. Verl builds bridges between the two systems, connecting data flows and parameter flows.
Synchronous training wastes throughput — within the same batch of prompts, some samples finish quickly while others drag on with long reasoning chains. The training side sits idle waiting for the slowest rollouts. Switch to async, and you inherit policy staleness, queue congestion, and failure retry hell.
Seriously.
RL infra difficulty lives entirely in these engineering details, not in fancy algorithmic innovations. I tried building a minimal prototype one weekend, and just getting rollout results correctly fed back into the training pipeline took two days. Shape mismatches. Device misalignment. Gradient disconnections. Debugged until I questioned my life choices.
I Didn't Prepare a Single Data-Related Question
Because you can't memorize your way through them.
What reward designs have you actually built? What distribution shifts have you handled? What reward hacking traps have you stepped in? These only come from real experience. Interviewers can tell in two minutes whether you've done the work or just memorized talking points. When I got asked about reward hacking detection at one company, I launched into theory — the interviewer cut me off: "Have you actually encountered this? How did you fix it?" I froze.
Useless.
This is probably my deepest takeaway after grinding all these interview questions: memorization doesn't work. There's no substitute for hands-on experience.
But here's the counterpoint — if you can't clearly explain these 35 questions, most interviewers won't give you the chance to prove your experience in the first place. So I'd still recommend working through them. Pull up search for each one. Use LLMs to interrogate the topics. Don't expect standard answers. The extension space on these questions is enormous — any interviewer can drill deeper until they hit the limits of your knowledge.
Which is exactly what makes RL interesting as a field. It hasn't ossified into rote memorization yet.
That's honestly a good thing.
Key Takeaways:
- RL interviews don't separate algo from infra anymore — you need both
- Hand-calculate memory usage for different training configurations. Yes, really.
- Understand why GRPO drops the Critic (LLM action spaces, variance, token-level chaos)
- MoE RL's hardest problems are engineering, not math — routing consistency, load balancing, parallelism
- RL infra is pipeline engineering, not training optimization
- Data questions can't be memorized — you either did the work or you didn't
- Don't expect standard answers. Interviewers will push until you break — that's the point
What's been your experience with RL interviews? Hit me up in the comments — especially if you've encountered questions I missed. I'm still collecting.
RL #MachineLearning #InterviewPrep #LLM #DeepLearning
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.