Cursarium logoCursarium
intermediateCertificate$149

Data Science: Machine Learning

by Rafael Irizarry · edX

4.5
(3,800 reviews)
120K+ enrolled8 weeksUpdated 2024-06

Our Verdict

Worth it — with caveats

Data Science: Machine Learning (HarvardX PH125.8x, recently relisted on edX as 'Data Science: Building Machine Learning Models') is the eighth of nine courses in Harvard's Data Science Professional Certificate, taught by Professor Rafael Irizarry. Across six sections it teaches core supervised-learning workflow in R, culminating in a hands-on MovieLens movie-recommendation system that ties together cross-validation, regularization, matrix factorization and PCA. It is a genuinely strong, rigorous course with an excellent instructor, but multiple independent reviewers and students flag it as the hardest installment of the certificate and note real friction: a steep difficulty jump, assessments that reference topics (PCA, clustering) only lightly covered in lecture, and occasional buggy/assignment edge cases. Take it for the depth and the recommendation-system project, but do not treat it as a beginner-friendly first ML course.

High-quality, rigorous R-based intro to applied ML with a real capstone-style project and a respected instructor, but it assumes comfort with R, linear algebra and basic calculus and is widely reported as the steepest course in the certificate, so it is a strong 'take' for the prepared and a 'skip' for true beginners.

Best for: Learners already comfortable with R (ideally after the earlier courses in the HarvardX Data Science series) who want a rigorous, math-grounded introduction to the supervised machine-learning workflow and a concrete portfolio project (a MovieLens movie-recommendation system). A good fit for self-learners and aspiring data analysts who want to understand the why behind algorithms, not just call library functions, and who are pursuing the full HarvardX Data Science Professional Certificate.

Skip if: Complete beginners with no programming or no linear-algebra/calculus background (multiple reviewers say the difficulty 'ramps up significantly' and is out of step with the certificate's beginner framing); Python-only practitioners who do not want to learn R; and people who want production MLOps, deep learning, or modern frameworks (the course omits SVMs, boosting and neural networks and is R/caret-centric).

About This Course

Part of Harvard's Data Science certificate covering cross-validation, kNN, random forests, and recommendation systems in R.

What You'll Learn

The basics of machine learning: training vs. test sets, conditional probabilities, and how to evaluate algorithms with accuracy, confusion matrices, sensitivity/specificity and F1
How to perform cross-validation to estimate error and avoid overtraining
Several popular algorithms, including k-nearest neighbors (kNN), linear/logistic regression for prediction, smoothing, and tree-based and generative methods (e.g. naive Bayes, QDA/LDA)
How to use the caret package to train, tune and compare models and handle higher-dimensional classification
What regularization is and why it is useful for improving predictions
How to build a movie recommendation system, applying matrix factorization and principal component analysis (PCA) on the MovieLens data

Curriculum

Section 1: Introduction to Machine Learning

Core terminology and concepts: features, outcomes, prediction vs. inference, and the supervised-learning framing used throughout the course.

Section 2: Machine Learning Basics

Building a first algorithm with training and test sets; the role of conditional probabilities; evaluation via accuracy, confusion matrix, sensitivity/specificity, prevalence, ROC and F1; introduction to the caret workflow.

Section 3: Linear Regression for Prediction, Smoothing, and Working with Matrices

Why linear (and logistic) regression is a useful but often insufficiently flexible baseline; smoothing noisy data (e.g. loess/bin smoothing); and using matrices/matrix algebra for machine learning in R.

Section 4: Distance, kNN, Cross Validation, and Generative Models

Distance metrics; k-nearest neighbors; k-fold and bootstrap cross-validation to tune parameters and avoid overtraining; and discriminative vs. generative approaches (naive Bayes, LDA/QDA).

Section 5: Classification with More than Two Classes and the Caret Package

Multi-class classification, classification and regression trees and random forests, the curse of dimensionality, and methods/practical use of the caret package that adapt to higher dimensions.

Section 6: Model Fitting and Recommendation Systems

Capstone-style synthesis: applying the algorithms learned, regularization, principal component analysis (PCA) and matrix factorization to build a movie recommendation system on the MovieLens dataset.

Prerequisites

  • Working knowledge of R and RStudio (the course assumes prior HarvardX Data Science courses such as R Basics, Visualization, Probability and Inference & Modeling)
  • Comfort with basic linear algebra (matrices) and introductory calculus, plus probability/statistics fundamentals
  • Familiarity with tidyverse-style data wrangling; no prior machine-learning experience required, but the pace assumes mathematical maturity

Instructor

Rafael Irizarry

Instructor · edX

Pros & Cons

Pros

  • Excellent instruction from Professor Rafael Irizarry (Harvard biostatistics): reviewers consistently note he explains clearly, succinctly and slowly enough that motivated learners can follow even difficult material
  • Rigorous, math-grounded treatment that teaches the intuition and mechanics behind algorithms (cross-validation, regularization, generative models) rather than just API calls
  • A concrete, portfolio-worthy capstone project: building a MovieLens movie-recommendation system that integrates regularization, PCA and matrix factorization
  • Free to audit, self-paced, and backed by an active discussion board and a free companion textbook (Irizarry's 'Introduction to Data Science'), so the full content is accessible at no cost
  • Hands-on R/caret practice that maps directly onto real data-analysis work and prepares learners for the certificate's capstone

Cons

  • Steep, widely reported difficulty spike: independent reviewers and students call it the hardest course in the certificate and say it is poorly aligned with the program's 'no experience needed' beginner framing
  • Assessment mismatch: several reviewers report graded assignments/exam questions that reference topics (e.g. PCA, clustering) given only light lecture coverage, plus occasional bugs or unclear edge cases in assignments
  • Narrow algorithmic and tooling scope: it is R/caret-centric and omits widely used methods such as SVMs, boosting/gradient boosting and neural networks/deep learning
  • R-only: not suitable for learners who want or need Python, which is the more common industry ML language

Alternatives To Consider

Frequently Asked Questions

Is Data Science: Machine Learning free?

Data Science: Machine Learning is $149. Free to audit the full course content; a verified certificate costs $149 (edX). The certificate is only needed if you want the credential or are completing the paid HarvardX Data Science Professional Certificate; the lectures, exercises and companion textbook are otherwise available at no cost. Audit access may be time-limited on self-paced runs. Verify the current price on edX, as edX promo discounts and certificate pricing change.

Who is Data Science: Machine Learning for?

Learners already comfortable with R (ideally after the earlier courses in the HarvardX Data Science series) who want a rigorous, math-grounded introduction to the supervised machine-learning workflow and a concrete portfolio project (a MovieLens movie-recommendation system). A good fit for self-learners and aspiring data analysts who want to understand the why behind algorithms, not just call library functions, and who are pursuing the full HarvardX Data Science Professional Certificate.

What will you learn in Data Science: Machine Learning?

The basics of machine learning: training vs. test sets, conditional probabilities, and how to evaluate algorithms with accuracy, confusion matrices, sensitivity/specificity and F1; How to perform cross-validation to estimate error and avoid overtraining; Several popular algorithms, including k-nearest neighbors (kNN), linear/logistic regression for prediction, smoothing, and tree-based and generative methods (e.g. naive Bayes, QDA/LDA); How to use the caret package to train, tune and compare models and handle higher-dimensional classification.

What are the prerequisites for Data Science: Machine Learning?

Working knowledge of R and RStudio (the course assumes prior HarvardX Data Science courses such as R Basics, Visualization, Probability and Inference & Modeling); Comfort with basic linear algebra (matrices) and introductory calculus, plus probability/statistics fundamentals; Familiarity with tidyverse-style data wrangling; no prior machine-learning experience required, but the pace assumes mathematical maturity.

Is Data Science: Machine Learning worth it?

High-quality, rigorous R-based intro to applied ML with a real capstone-style project and a respected instructor, but it assumes comfort with R, linear algebra and basic calculus and is widely reported as the steepest course in the certificate, so it is a strong 'take' for the prepared and a 'skip' for true beginners.