intermediateFree

Building Multimodal Search and RAG

Name: Building Multimodal Search and RAG
Rating: 4.5 (2800 reviews)

by Sebastian Witalec · DeepLearning.AI

4.5

(2,800 reviews)

45K+ enrolled1 hourUpdated 2024-10

Go to Course

Our Verdict

Worth it — with caveats

Building Multimodal Search and RAG is a free, roughly 1.5-hour DeepLearning.AI short course built with Weaviate and taught by Sebastian Witalec, Weaviate's Head of Developer Relations; it is a strong, hands-on intro to any-to-any (text, image, video) retrieval and multimodal RAG for engineers who already know basic Python and standard text RAG. Across eight short video lessons and six runnable notebooks, you implement contrastive multimodal embeddings on MNIST, build text-to-any and any-to-any search with Weaviate plus Google Vertex AI's multimodalembedding model, query images with Gemini Vision, assemble an end-to-end multimodal RAG pipeline, extract structured data from invoices and tables, and build a multi-vector recommender on a movies dataset. It holds a real 4.5/5 rating from 43 reviews on Coursera (about 6,730 enrolled there), and learners praise it as a concise, practical primer rather than a deep theoretical course. The major caveat is currency: as of 2026 the course has been flagged for maintenance and community threads report broken lab code tied to Google's deprecation of the PaLM API and Vertex AI authentication headers, so the notebooks may not run as-recorded without fixes. Treat it as a conceptual and pattern walkthrough worth taking for free, but expect to debug or substitute the Google/Vertex API calls if you run the labs end-to-end.

The content, instructor, and free price make this a high-value 1.5-hour primer on multimodal embeddings and RAG, but it is short, certificate-less, and (as of 2026) its labs have known breakage from deprecated Google PaLM/Vertex APIs, so it is a clear take only for learners comfortable patching outdated API code.

Best for: Intermediate ML/AI engineers and backend developers who already understand standard text-based RAG, embeddings, and vector databases, and want a fast, hands-on introduction to extending those patterns to images and video using multimodal embeddings, Weaviate, and Gemini/Vertex. Ideal for people evaluating whether multimodal search/RAG fits a product, or who learn best from short runnable notebooks over long lectures.

Skip if: Complete beginners with no Python or no prior RAG/embeddings exposure (the course is explicitly intermediate and moves quickly); anyone who needs a completion certificate; learners who want deep theory on multimodal model training; and anyone who needs labs that run flawlessly out of the box right now, given the documented deprecated-API breakage and maintenance status.

About This Course

Build search and RAG systems that handle text, images, and video using multimodal embeddings and Weaviate.

What You'll Learn

How multimodality works via contrastive representation learning, including building a simple contrastive model on the MNIST dataset to unify text and image embeddings

Building text-to-any and any-to-any search over images and video using Weaviate and Google Vertex AI's multimodalembedding model

How large multimodal models (LMMs) combine LLMs with vision, and querying images directly with Google's Gemini Vision model

Implementing an end-to-end multimodal RAG (MM-RAG) pipeline that retrieves multimodal context from Weaviate and reasons over it to generate answers

Industry applications: extracting structured data from images such as invoices, tables, and flowcharts, then answering queries over the extracted data with LLM reasoning

Building a multi-vector recommender system over a movies dataset using both OpenAI embeddings and Vertex multimodal embeddings

Curriculum

Introduction

Course overview and framing of multimodal search and RAG.

Overview of Multimodality

Unifying multimodal embedding models with contrastive representation learning; builds a contrastive model on the MNIST text+image dataset with dimensionality-reduction visualization.

Multimodal Search

Text-to-any and any-to-any search across images and video using Weaviate and Google Vertex AI's multimodalembedding model.

Large Multimodal Models (LMMs)

How LLMs integrate with vision to form language-vision models; querying images with Google's Gemini Vision model.

Multimodal RAG (MM-RAG)

Combining retrieval from Weaviate with a large multimodal model to build an end-to-end multimodal RAG pipeline.

Industry Applications

Extracting structured data from images (invoices, tables, flowcharts) and using LLM reasoning to answer queries over the extracted data.

Multimodal Recommender System

Building a multi-vector recommender on a movies dataset using OpenAI embeddings alongside Vertex multimodal embeddings.

Prerequisites

Working Python proficiency (you run and modify Jupyter-style notebooks)
Familiarity with embeddings, vector databases, and basic text RAG concepts
Comfort reading API docs and patching deprecated Google/Vertex API calls if running labs in 2026
A Google Cloud / Vertex AI (and optionally OpenAI) API key to fully exercise the multimodal embedding and generation labs

Instructor

Sebastian Witalec

Instructor · DeepLearning.AI

Pros & Cons

Pros

Free, concise, and genuinely hands-on: six runnable notebooks let you implement multimodal embeddings, search, RAG, and a recommender in roughly 1.5 hours
Taught by Sebastian Witalec, Weaviate's Head of Developer Relations, so the vector-database and multimodal-search workflows reflect real production tooling
Covers an unusually broad, practical arc for a short course: contrastive embeddings, any-to-any retrieval, Gemini Vision, end-to-end MM-RAG, structured extraction, and recommendations
Strong real reception (4.5/5 from 43 Coursera reviews) and a clear, applied teaching style that builds intuition fast
Uses recognizable industry stack (Weaviate + Google Vertex/Gemini + OpenAI embeddings) that maps directly to building your own multimodal pipelines

Cons

As of 2026 the course is flagged for maintenance and community threads report broken lab code due to Google's deprecation of the PaLM API and Vertex AI auth headers, so notebooks may not run as recorded without fixes
Very short and surface-level for the topics covered: it introduces multimodal RAG patterns but does not go deep on model training, evaluation, or production scaling
No certificate of completion, which matters for learners who want a credential
Tightly coupled to Weaviate and Google Cloud/Vertex, so concepts are taught through a specific vendor stack rather than tool-agnostically

Alternatives To Consider

LangChain for LLM Application Development

DeepLearning.AI

View course

Generative AI with Large Language Models

Coursera

View course

NLP Course

Hugging Face

View course

Frequently Asked Questions

Is Building Multimodal Search and RAG free?

Yes — Building Multimodal Search and RAG is free to access. Free to take in full on DeepLearning.AI / learn.deeplearning.ai; no payment and no certificate. Running the labs requires your own Google Cloud / Vertex AI (and optionally OpenAI) API credentials, and as of 2026 may require patching deprecated Google API calls.

Who is Building Multimodal Search and RAG for?

Intermediate ML/AI engineers and backend developers who already understand standard text-based RAG, embeddings, and vector databases, and want a fast, hands-on introduction to extending those patterns to images and video using multimodal embeddings, Weaviate, and Gemini/Vertex. Ideal for people evaluating whether multimodal search/RAG fits a product, or who learn best from short runnable notebooks over long lectures.

What will you learn in Building Multimodal Search and RAG?

How multimodality works via contrastive representation learning, including building a simple contrastive model on the MNIST dataset to unify text and image embeddings; Building text-to-any and any-to-any search over images and video using Weaviate and Google Vertex AI's multimodalembedding model; How large multimodal models (LMMs) combine LLMs with vision, and querying images directly with Google's Gemini Vision model; Implementing an end-to-end multimodal RAG (MM-RAG) pipeline that retrieves multimodal context from Weaviate and reasons over it to generate answers.

What are the prerequisites for Building Multimodal Search and RAG?

Working Python proficiency (you run and modify Jupyter-style notebooks); Familiarity with embeddings, vector databases, and basic text RAG concepts; Comfort reading API docs and patching deprecated Google/Vertex API calls if running labs in 2026; A Google Cloud / Vertex AI (and optionally OpenAI) API key to fully exercise the multimodal embedding and generation labs.

Is Building Multimodal Search and RAG worth it?

How we reviewed this course

This is an independent editorial assessment by Cursarium, based on DeepLearning.AI's published course materials and aggregated public learner feedback (last reviewed 2026-06). We have not independently completed the course. Links to providers are standard references, not paid placements.

Sources

Free

Go to Course