Cursarium logoCursarium
intermediateCertificateFree

Intermediate Machine Learning

by Alexis Cook · Kaggle

4.5
(7,200 reviews)
400K+ enrolled4 hoursUpdated 2024-03

Our Verdict

Worth taking

Kaggle's free Intermediate Machine Learning is a focused, ~4-hour micro-course (an Introduction plus six hands-on lessons) by Alexis Cook that teaches the practical scikit-learn and XGBoost workflow needed to make real tabular models work: handling missing values, encoding categorical variables, bundling preprocessing into pipelines, cross-validation, gradient boosting with XGBoost, and detecting data leakage. Our verdict from analyzing the official syllabus and the GitHub mirror of its notebooks plus independent reviews: it is one of the best free ways to close the gap between a beginner who can fit a model and a practitioner who can structure a defensible ML workflow on messy data. Its genuine strength is that every lesson pairs a short reading with an in-browser coded exercise on the Ames Housing dataset, so you actually run SimpleImputer, OneHotEncoder, Pipeline, cross_val_score, and XGBRegressor (with n_estimators, learning_rate, and early_stopping_rounds) yourself. The honest limitation, flagged by independent reviewers, is that it treats algorithms as black boxes (no math or theory of how XGBoost works) and that several later lessons lean on copying the provided code rather than writing it from scratch. Take it as a fast, free skills top-up, not as a course that will teach you why the methods work.

It is free, short (~4 hours), and delivers exactly the high-leverage tabular-ML skills (pipelines, cross-validation, XGBoost, leakage) that beginners typically lack, with hands-on coded exercises and a free completion certificate. The only real caveat is its deliberate lack of theory and some copy-paste-heavy lessons, which keeps it from being a standalone education but does not undermine its value as a focused practical add-on.

Best for: Learners who have finished Kaggle's Intro to Machine Learning (or equivalent) and can already train a basic scikit-learn model, and now want to handle real-world messy data and ship more accurate models. Ideal for aspiring data analysts/scientists preparing for Kaggle tabular competitions or job tasks, bootcamp students wanting a free practical supplement, and working developers who need the practical XGBoost + pipeline workflow quickly without theory.

Skip if: Complete beginners who have never trained a model (start with Kaggle Intro to Machine Learning first), and anyone who wants to understand the mathematics and internals of the algorithms (how gradient boosting actually works), since the course intentionally treats models as black boxes. Also not ideal for those focused on deep learning, neural networks, NLP, or computer vision, as the scope is strictly classical tabular ML in scikit-learn and XGBoost.

About This Course

Handle missing values, categorical variables, pipelines, cross-validation, XGBoost, and data leakage in ML workflows.

What You'll Learn

Handle missing values three ways: dropping columns, imputation with SimpleImputer, and imputation plus a 'was-missing' indicator column
Encode categorical (non-numeric) variables using ordinal encoding and one-hot encoding, and know when each is appropriate
Bundle preprocessing and modeling into scikit-learn Pipelines (with ColumnTransformer) for cleaner, less bug-prone, production-ready code
Use k-fold cross-validation via cross_val_score to estimate model performance more reliably than a single validation split
Build and tune gradient-boosted models with XGBoost's XGBRegressor, adjusting n_estimators, learning_rate, and early_stopping_rounds and evaluating with mean absolute error
Detect and prevent data leakage, distinguishing target leakage from train-test contamination so models don't look great in validation but fail in production

Curriculum

Introduction

Sets up the course and the workflow, building on Intro to Machine Learning; uses the Housing Prices (Ames Housing) competition dataset for the exercises.

Missing Values

Three strategies for missing data: drop columns, impute with SimpleImputer (mean/median/most-frequent), and imputation extended with an indicator column flagging where values were originally missing.

Categorical Variables

Using non-numeric data in models via three approaches: dropping categorical columns, ordinal encoding for ranked categories, and one-hot encoding to avoid imposing a false order.

Pipelines

Bundling preprocessing and modeling with ColumnTransformer and Pipeline for cleaner code, fewer bugs, and easier deployment; pipelines also enable clean cross-validation.

Cross-Validation

k-fold cross-validation with cross_val_score to get a more robust performance estimate than a single train/validation split, and how it interacts with pipelines.

XGBoost

Gradient boosting with XGBRegressor from the xgboost library (outside scikit-learn but included for its competition performance); tuning n_estimators, learning_rate, and early_stopping_rounds, evaluated with mean_absolute_error on the Ames Housing data.

Data Leakage

Identifying and removing leakage: target leakage (predictors containing information unavailable at prediction time) versus train-test contamination (validation data influencing preprocessing), with case-study examples.

Prerequisites

  • Completion of Kaggle's Intro to Machine Learning (or equivalent ability to train a basic scikit-learn model and use train/validation splits)
  • Working Python knowledge (functions, loops, imports)
  • Basic pandas familiarity (DataFrames, reading CSVs, selecting columns)
  • No setup required — exercises run in Kaggle's in-browser notebooks

Instructor

Alexis Cook

Instructor · Kaggle

Pros & Cons

Pros

  • Completely free, no paywall, and grants a Kaggle certificate of completion; all coding is done in Kaggle's in-browser notebooks with no local setup
  • Tightly scoped to high-leverage, immediately useful tabular-ML skills — pipelines, cross-validation, XGBoost, and leakage prevention — that beginners commonly miss
  • Hands-on by design: each lesson pairs a short reading with a coded exercise on the real Ames Housing dataset, so you actually run SimpleImputer, OneHotEncoder, Pipeline, cross_val_score, and XGBRegressor
  • Very short time investment (~4 hours), making it an efficient supplement after an intro ML course or before a Kaggle competition
  • Taught by Alexis Cook, a recognized Kaggle Learn instructor, with a clear, practical, step-by-step style noted positively by independent reviewers

Cons

  • Intentionally treats algorithms as black boxes — it teaches how to use XGBoost but not how gradient boosting works, so you must look elsewhere for the underlying math and theory
  • Several later lessons are copy-paste heavy: independent reviewers note most original coding happens in the Missing Values and Categorical Variables lessons, while later exercises largely reuse the provided code
  • Narrow scope: strictly classical tabular ML in scikit-learn/XGBoost, with no coverage of deep learning, neural networks, NLP, or model deployment beyond pipelines
  • Short length means limited practice volume — you will need separate competitions or projects to truly internalize the skills

Alternatives To Consider

Frequently Asked Questions

Is Intermediate Machine Learning free?

Yes — Intermediate Machine Learning is free to access. 100% free. There is no paid tier and no audit-vs-paid distinction — the full course, in-browser notebooks, and a Kaggle certificate of completion are all free with a Kaggle account.

Who is Intermediate Machine Learning for?

Learners who have finished Kaggle's Intro to Machine Learning (or equivalent) and can already train a basic scikit-learn model, and now want to handle real-world messy data and ship more accurate models. Ideal for aspiring data analysts/scientists preparing for Kaggle tabular competitions or job tasks, bootcamp students wanting a free practical supplement, and working developers who need the practical XGBoost + pipeline workflow quickly without theory.

What will you learn in Intermediate Machine Learning?

Handle missing values three ways: dropping columns, imputation with SimpleImputer, and imputation plus a 'was-missing' indicator column; Encode categorical (non-numeric) variables using ordinal encoding and one-hot encoding, and know when each is appropriate; Bundle preprocessing and modeling into scikit-learn Pipelines (with ColumnTransformer) for cleaner, less bug-prone, production-ready code; Use k-fold cross-validation via cross_val_score to estimate model performance more reliably than a single validation split.

What are the prerequisites for Intermediate Machine Learning?

Completion of Kaggle's Intro to Machine Learning (or equivalent ability to train a basic scikit-learn model and use train/validation splits); Working Python knowledge (functions, loops, imports); Basic pandas familiarity (DataFrames, reading CSVs, selecting columns); No setup required — exercises run in Kaggle's in-browser notebooks.

Is Intermediate Machine Learning worth it?

It is free, short (~4 hours), and delivers exactly the high-leverage tabular-ML skills (pipelines, cross-validation, XGBoost, leakage) that beginners typically lack, with hands-on coded exercises and a free completion certificate. The only real caveat is its deliberate lack of theory and some copy-paste-heavy lessons, which keeps it from being a standalone education but does not undermine its value as a focused practical add-on.