Skip to main content

ML System Design Interview

End-to-end ML pipeline design coaching for staff+ engineers. Covers the full arc from problem definition through production monitoring -- the scope expected at L6+ interviews at top-tier ML organizations.

This skill assumes 15+ years of ML/CV/AI/NLP experience. It does not teach fundamentals. It structures the knowledge you already have into the format interviewers reward.


When to Use

Use for:

  • Practicing 45-minute ML system design rounds
  • Structuring whiteboard presentations for recommendation, ranking, RAG, fraud, perception systems
  • Analyzing serving architecture tradeoffs (batch vs online vs streaming)
  • Identifying L6+ differentiation signals (problem ownership, org constraints, data flywheels)
  • Reviewing and critiquing ML system design answers

NOT for:

  • Coding interviews (use senior-coding-interview)
  • Behavioral / leadership questions (use interview-loop-strategist)
  • ML theory or math derivations
  • Implementing models or writing training code
  • Paper reading or research review

The 7-Stage Design Framework

Every ML system design answer follows this arc. The stages are sequential but you will loop back as constraints emerge. The Mermaid diagram below is your whiteboard skeleton.

flowchart TD
R[1. Requirements\n- Business goal\n- Users and scale\n- Latency/throughput SLA\n- Constraints] --> M[2. Metrics\n- Offline: precision, recall, NDCG\n- Online: CTR, conversion, revenue\n- Guardrails: latency p99, fairness]
M --> D[3. Data\n- Sources and collection\n- Labeling strategy\n- Pipeline: ETL, validation\n- Freshness and staleness]
D --> F[4. Features\n- Engineering and transforms\n- Feature store architecture\n- Online vs offline features\n- Freshness requirements]
F --> Mo[5. Model\n- Architecture selection\n- Training pipeline\n- Iteration strategy\n- Baseline and ablation]
Mo --> S[6. Serving\n- Batch vs online vs streaming\n- Caching and precomputation\n- Scaling and cost\n- Canary and shadow mode]
S --> Mon[7. Monitoring\n- Data drift detection\n- Model degradation alerts\n- A/B testing framework\n- Rollback strategy\n- Feedback loops]
Mon -.->|Feedback loop| D
Mon -.->|Retrain trigger| Mo

Stage Details

Stage 1 -- Requirements (5 minutes) Ask clarifying questions before designing anything. Establish: Who is the user? What is the business metric? What is the latency SLA? What scale (QPS, data volume)? What are hard constraints (cost, privacy, regulation)? An L6+ candidate owns the problem definition -- do not wait for the interviewer to hand you requirements.

Stage 2 -- Metrics (3 minutes) Define offline metrics that you can measure before deployment AND online metrics that matter to the business. Explain the gap: "NDCG improvement offline does not always translate to CTR lift online because of position bias and novelty effects." Define guardrail metrics: latency p99, fairness across user segments, cost per prediction.

Stage 3 -- Data (7 minutes) Where does training data come from? How is it labeled (human, weak supervision, implicit signals)? What is the class balance? How fresh does data need to be? What is the data pipeline (batch ETL vs streaming)? What data quality checks exist? This stage separates L6+ candidates from L5 -- junior candidates assume clean labeled data.

Stage 4 -- Features (5 minutes) What features does the model need? Which are precomputed (offline) vs computed at request time (online)? Feature store architecture: online store (low-latency lookups) vs offline store (batch training). Feature freshness: user features update daily, item features update hourly, contextual features are real-time.

Stage 5 -- Model (8 minutes) Start with a simple baseline (logistic regression, XGBoost) and explain why. Then propose the production architecture (two-tower, transformer, etc.) and justify the upgrade. Discuss training pipeline: how often, how much data, how to handle distribution shift. Iteration strategy: what experiments to run first.

Stage 6 -- Serving (8 minutes) This is where system design and ML intersect. Discuss: inference latency requirements, batch precomputation vs online inference, GPU/CPU tradeoffs, model serving framework, caching strategy, cost optimization (quantization, distillation, spot instances). Draw the serving architecture.

Stage 7 -- Monitoring (5 minutes) What happens after deployment? Data drift detection (PSI, KL divergence). Model degradation alerts (metric decay over time). A/B testing framework (sample size, duration, novelty effects). Rollback strategy (shadow mode, canary percentage). Feedback loops that improve the model over time.


45-Minute Time Budget

PhaseMinutesWhat to Cover
Requirements + Clarification5Business goal, users, scale, SLA, constraints
Metrics3Offline, online, guardrails, metric alignment
Data7Sources, labeling, pipeline, quality, freshness
Features5Engineering, store architecture, online/offline split
Model8Baseline, production arch, training, iteration
Serving8Latency, architecture, cost, deployment strategy
Monitoring5Drift, alerts, A/B testing, rollback, feedback
Q&A Buffer4Interviewer deep-dives, defend tradeoffs

If the interviewer cuts in with questions, adapt -- but cover all 7 stages even briefly. Skipping monitoring is the most common L5 mistake.


Canonical Problem Set

ProblemKey ChallengesMust-Discuss
Recommendation SystemCold start, position bias, multi-objective optimizationTwo-tower retrieval + reranking, exploration-exploitation
Search RankingQuery intent classification, relevance vs engagement, latency at scaleInverted index + embedding retrieval, L1/L2 ranking cascade
Content ModerationMulti-modal (text+image+video), adversarial evasion, precision-recall tradeoffHuman-in-the-loop, escalation tiers, appeal workflow
RAG PipelineRetrieval quality, chunk strategy, hallucination detection, evaluationEmbedding model selection, hybrid search, reranking, citation
Fraud DetectionExtreme class imbalance, adversarial adaptation, real-time requirementFeature velocity, graph features, ensemble + rules, feedback delay
Autonomous Driving PerceptionSensor fusion, safety-critical latency, long-tail distributionMulti-task architecture, simulation, OTA updates, regulatory

Serving Architecture Comparison

PatternLatencyFreshnessCostBest For
Batch predictionN/A (precomputed)Hours-staleLow compute, high storageEmail recommendations, daily reports
Online inference10-500msReal-timeHigh compute (GPU)Search ranking, fraud detection
Near-real-time1-60sMinutes-freshMediumFeed ranking, content moderation
StreamingSub-secondContinuousHigh (always-on)Fraud, anomaly detection, bidding

Detailed serving tradeoffs, framework comparisons, and cost optimization strategies are in references/serving-tradeoffs.md.


L6+ Differentiation Signals

What separates a staff+ answer from a senior answer:

1. Own the Problem Definition Do not accept the problem as stated. Ask: "What business metric are we optimizing? Is this a revenue problem or an engagement problem? What is the current solution and why is it insufficient?" L5 candidates accept "build a recommendation system." L6+ candidates ask "what are we recommending, to whom, and what does success look like?"

2. Discuss Organizational Constraints Real systems live inside organizations. Address: team size (can we maintain a custom model or should we use a managed service?), on-call burden, cross-team data dependencies, compliance requirements, migration path from legacy system.

3. Data Flywheel Strategy Show that you think about the virtuous cycle: better model -> more engagement -> more data -> better model. Discuss how to accelerate it: active learning, implicit feedback loops, exploration strategies, cold-start bootstrapping.

4. Build vs Buy Decisions Not everything should be custom. Argue for managed services where appropriate (embedding APIs, feature stores, serving platforms) and custom solutions where competitive advantage demands it. Show you understand the total cost of ownership.

5. Multi-Objective Thinking Real systems optimize multiple objectives simultaneously: relevance AND diversity, accuracy AND fairness, quality AND latency. Discuss how to handle conflicts: Pareto optimization, constrained optimization, multi-task learning, business-rule post-processing.


Whiteboard Strategy

What to draw and when:

TimeDraw ThisPurpose
0-5 minRequirements box with bullet pointsAnchor the discussion, show structured thinking
5-8 minMetric table (offline vs online)Demonstrate you think beyond model accuracy
8-15 minData pipeline diagram (sources -> ETL -> store)Show you understand data engineering
15-20 minFeature architecture (offline store + online store)Demonstrate feature store knowledge
20-28 minModel architecture + serving diagramThe core system design artifact
28-36 minFull system diagram with latency annotationsConnect everything, show you can ship
36-41 minMonitoring dashboard sketch + feedback arrowsClose the loop, show production thinking

Use boxes for components, arrows for data flow, and annotate with latency/throughput numbers. The diagram should be readable by someone who walks in at minute 30.


Anti-Patterns

Model-First Thinking

Novice: Jumps to "I would use a transformer" or "Let me describe the attention mechanism" in the first 2 minutes, before understanding the problem, defining metrics, or discussing data. Spends 70% of time on model architecture and 0% on serving.

Expert: Spends the first 10 minutes on requirements, metrics, and data before mentioning any model. Names a simple baseline first (logistic regression on handcrafted features), then argues for complexity only when the baseline's limitations are clear. Allocates equal time to serving and monitoring.

Detection: Architecture diagram has a detailed model box but no data pipeline, no feature store, no serving layer, and no monitoring component. Mentions model architecture in the first sentence.

Ignoring the Data

Novice: Assumes clean, labeled data exists at scale. Says "we would train on millions of labeled examples" without discussing where labels come from, how much they cost, what the class distribution looks like, or how stale the data gets.

Expert: Asks about data sources, labeling strategy (human vs weak supervision vs implicit signals), class imbalance handling, data freshness SLA, and data quality monitoring. Discusses the cost of labeling and proposes strategies to reduce it (active learning, semi-supervised methods, synthetic data).

Detection: No discussion of data collection, labeling costs, class imbalance, data quality checks, or data freshness anywhere in the answer. The word "label" does not appear.

No Monitoring Story

Novice: Design ends at the serving layer. No mention of what happens after the model is deployed. Does not discuss how to detect degradation, how to roll back, or how to improve the model over time.

Expert: Discusses data drift detection (population stability index, feature distribution monitoring), model performance decay alerts, A/B testing framework with proper statistical rigor, canary deployment strategy, shadow mode for safe rollouts, and explicit feedback loops that flow data back into retraining.

Detection: Architecture diagram has no monitoring component. No feedback arrows from production back to training. No mention of A/B testing, canary deployment, or rollback.


Reference Files

Consult these for deep dives -- they are NOT loaded by default:

FileConsult When
references/ml-design-templates.mdWorking through a specific problem (recommendation, search, RAG, fraud, content mod, perception). Contains 6 fully worked designs with Mermaid diagrams.
references/serving-tradeoffs.mdDeep-diving on serving architecture, framework selection, caching, cost optimization, deployment strategies. Contains framework comparisons and latency targets by use case.
references/evaluation-metrics-guide.mdChoosing metrics, understanding metric alignment, designing A/B tests, evaluating generative AI. Contains metric decision trees and formulas.