Machine Learning

Production Project Checklist

Frame the problem. Define a clear and concise objective with clear metrics. Write it as a design doc. To know “what it is good enough” you have to collect and annotate more data than most people and organizations want to do.
Get the data. Make the data tidy. Machine learning models are only as reliable as the data used to train them. The data matters more than the model. Data matters more than the model. The main bottleneck is collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.
Explore the data. Verify any assumptions. Garbage in, garbage out. Remove ALL friction from looking at data.
Create a model. Start with the simplest model!. That will be the baseline model. Evaluate the model with the defined metric.
Make sure everything works end to end. You design it, you train it, you deploy it. Deploy the model quickly and automatically. Add a clear description of the model. Monitor model performance in production.
Make results (models, analysis, graphs, …) reproducible (code, environment and data). Version your code, data and configuration. Make feature dependencies explicit in the code. Separate code from configuration.
Test every part of the system (ML Test Score): data (distributions, unexpected values, biases, …).
Iterate. Deliver value first, then iterate. Go back to the first point and change one thing at a time. Machine Learning progress is nonlinear. It’s really hard to tell in advance what’s hard and what’s easy.

Engineering projects generally move forward, but machine learning projects can completely stall. It’s possible, even common, for a week spent on modeling data to result in no improvement whatsoever.
Track every experiment you do. Keep a reverse-time sorted doc where we you bullet points of what ideas you’ve tried and how they’ve gone. Look for data flywheel, harnessing the power of user’s generated data to rapidly improve the whole system. These are powerful Feedback Loops. Attempt a portfolio of approaches.

Explain your results in terms your audience cares about. Data is only useful as long as it's being used.

These points are expanded with more details in courses like Made With ML.

Evals

Don’t hope for “great”, specify it, measure it, and improve toward it!

Evals make fuzzy goals and abstract ideas specific and explicit. They help you systematically measure and improve a system.
Evals are a key set of tools and methods to measure and improve the ability of an AI system to meet expectations.
Success with AI hinges on how fast you can iterate. You must have processes and tools for evaluating quality (tests), debugging issues (logging, inspecting data), and changing the behavior or the system (prompt eng, fine-tuning, writing code).
Collecting good evals will make you understand the problem better.
Working with probabilistic systems requires new kinds of measurement and deeper consideration of trade-offs.
Don’t work if you cannot define what “great” means for your use case.
Evals replace LGTM-vibes development. They systematize quality when outputs are non-deterministic.
Error analysis workflow: build a simple trace viewer, review ~100 traces, annotate the first upstream failure (open coding), cluster into themes (axial coding), and use counts to prioritize. Bootstrap with grounded synthetic data if real data is thin.
Pick the right evaluator: code-based assertions for deterministic failures; LLM-as-judge for subjective ones. Keep labels binary (PASS/FAIL) with human critiques. Partition data so the judge cannot memorize answers; validate the judge against human labels (TPR/TNR) before trusting it.
Run evals in CI/CD and keep monitoring with production data.
- Analyze → measure → improve → automate → repeat.
Good eval metrics:
- Measure an error you’ve observed.
- Relates to a non-trivial issue you will iterate on.
- Are scoped to a specific failure.
- Has a binary outcome (not a 1–5 score).
- Is verifiable (i.e. human labels for LLM-as-a-Judge)

The Eval Loop

Specify.

Define what “great” means.
Write down the purpose of your AI system in plain terms.
The resulting golden set of examples should be a living, authoritative reference of your most skilled experts’ judgement and taste for what “great” looks like.
The process is iterative and messy.

Measure

Test against real-world conditions. Reliably surface concrete examples of how and when the system is failing.
Use examples drawn from real-world situations whenever possible.

Improve

Learn from errors.
Addressing problems uncovered by your eval can take on many forms: refining prompts, adjusting data access, updating the eval itself to better reflect your goals, …

ML In Production Resources

Machine Learning Technical Debt

Tech debt is an analogy for the long-term buildup of costs when engineers make design choices for speed of deployment over everything else. Fixing technical debt can take a lot of work.

Track data dependencies.
Version the datasets.
Make sure your data isn’t all noise and no signal by making sure your model is at least capable of overfitting.
Use reproducibility checklists when releasing code.