Building Multimodal Search and RAG
by Sebastian Witalec · DeepLearning.AI
Our Verdict
Worth it — with caveatsBuilding Multimodal Search and RAG is a free, roughly 1.5-hour DeepLearning.AI short course built with Weaviate and taught by Sebastian Witalec, Weaviate's Head of Developer Relations; it is a strong, hands-on intro to any-to-any (text, image, video) retrieval and multimodal RAG for engineers who already know basic Python and standard text RAG. Across eight short video lessons and six runnable notebooks, you implement contrastive multimodal embeddings on MNIST, build text-to-any and any-to-any search with Weaviate plus Google Vertex AI's multimodalembedding model, query images with Gemini Vision, assemble an end-to-end multimodal RAG pipeline, extract structured data from invoices and tables, and build a multi-vector recommender on a movies dataset. It holds a real 4.5/5 rating from 43 reviews on Coursera (about 6,730 enrolled there), and learners praise it as a concise, practical primer rather than a deep theoretical course. The major caveat is currency: as of 2026 the course has been flagged for maintenance and community threads report broken lab code tied to Google's deprecation of the PaLM API and Vertex AI authentication headers, so the notebooks may not run as-recorded without fixes. Treat it as a conceptual and pattern walkthrough worth taking for free, but expect to debug or substitute the Google/Vertex API calls if you run the labs end-to-end.
The content, instructor, and free price make this a high-value 1.5-hour primer on multimodal embeddings and RAG, but it is short, certificate-less, and (as of 2026) its labs have known breakage from deprecated Google PaLM/Vertex APIs, so it is a clear take only for learners comfortable patching outdated API code.
Best for: Intermediate ML/AI engineers and backend developers who already understand standard text-based RAG, embeddings, and vector databases, and want a fast, hands-on introduction to extending those patterns to images and video using multimodal embeddings, Weaviate, and Gemini/Vertex. Ideal for people evaluating whether multimodal search/RAG fits a product, or who learn best from short runnable notebooks over long lectures.
Skip if: Complete beginners with no Python or no prior RAG/embeddings exposure (the course is explicitly intermediate and moves quickly); anyone who needs a completion certificate; learners who want deep theory on multimodal model training; and anyone who needs labs that run flawlessly out of the box right now, given the documented deprecated-API breakage and maintenance status.
About This Course
Build search and RAG systems that handle text, images, and video using multimodal embeddings and Weaviate.
What You'll Learn
Curriculum
Course overview and framing of multimodal search and RAG.
Unifying multimodal embedding models with contrastive representation learning; builds a contrastive model on the MNIST text+image dataset with dimensionality-reduction visualization.
Text-to-any and any-to-any search across images and video using Weaviate and Google Vertex AI's multimodalembedding model.
How LLMs integrate with vision to form language-vision models; querying images with Google's Gemini Vision model.
Combining retrieval from Weaviate with a large multimodal model to build an end-to-end multimodal RAG pipeline.
Extracting structured data from images (invoices, tables, flowcharts) and using LLM reasoning to answer queries over the extracted data.
Building a multi-vector recommender on a movies dataset using OpenAI embeddings alongside Vertex multimodal embeddings.
Prerequisites
- Working Python proficiency (you run and modify Jupyter-style notebooks)
- Familiarity with embeddings, vector databases, and basic text RAG concepts
- Comfort reading API docs and patching deprecated Google/Vertex API calls if running labs in 2026
- A Google Cloud / Vertex AI (and optionally OpenAI) API key to fully exercise the multimodal embedding and generation labs
Instructor
Sebastian Witalec
Instructor · DeepLearning.AI
Pros & Cons
Pros
- Free, concise, and genuinely hands-on: six runnable notebooks let you implement multimodal embeddings, search, RAG, and a recommender in roughly 1.5 hours
- Taught by Sebastian Witalec, Weaviate's Head of Developer Relations, so the vector-database and multimodal-search workflows reflect real production tooling
- Covers an unusually broad, practical arc for a short course: contrastive embeddings, any-to-any retrieval, Gemini Vision, end-to-end MM-RAG, structured extraction, and recommendations
- Strong real reception (4.5/5 from 43 Coursera reviews) and a clear, applied teaching style that builds intuition fast
- Uses recognizable industry stack (Weaviate + Google Vertex/Gemini + OpenAI embeddings) that maps directly to building your own multimodal pipelines
Cons
- As of 2026 the course is flagged for maintenance and community threads report broken lab code due to Google's deprecation of the PaLM API and Vertex AI auth headers, so notebooks may not run as recorded without fixes
- Very short and surface-level for the topics covered: it introduces multimodal RAG patterns but does not go deep on model training, evaluation, or production scaling
- No certificate of completion, which matters for learners who want a credential
- Tightly coupled to Weaviate and Google Cloud/Vertex, so concepts are taught through a specific vendor stack rather than tool-agnostically
Alternatives To Consider
Frequently Asked Questions
Is Building Multimodal Search and RAG free?
Yes — Building Multimodal Search and RAG is free to access. Free to take in full on DeepLearning.AI / learn.deeplearning.ai; no payment and no certificate. Running the labs requires your own Google Cloud / Vertex AI (and optionally OpenAI) API credentials, and as of 2026 may require patching deprecated Google API calls.
Who is Building Multimodal Search and RAG for?
Intermediate ML/AI engineers and backend developers who already understand standard text-based RAG, embeddings, and vector databases, and want a fast, hands-on introduction to extending those patterns to images and video using multimodal embeddings, Weaviate, and Gemini/Vertex. Ideal for people evaluating whether multimodal search/RAG fits a product, or who learn best from short runnable notebooks over long lectures.
What will you learn in Building Multimodal Search and RAG?
How multimodality works via contrastive representation learning, including building a simple contrastive model on the MNIST dataset to unify text and image embeddings; Building text-to-any and any-to-any search over images and video using Weaviate and Google Vertex AI's multimodalembedding model; How large multimodal models (LMMs) combine LLMs with vision, and querying images directly with Google's Gemini Vision model; Implementing an end-to-end multimodal RAG (MM-RAG) pipeline that retrieves multimodal context from Weaviate and reasons over it to generate answers.
What are the prerequisites for Building Multimodal Search and RAG?
Working Python proficiency (you run and modify Jupyter-style notebooks); Familiarity with embeddings, vector databases, and basic text RAG concepts; Comfort reading API docs and patching deprecated Google/Vertex API calls if running labs in 2026; A Google Cloud / Vertex AI (and optionally OpenAI) API key to fully exercise the multimodal embedding and generation labs.
Is Building Multimodal Search and RAG worth it?
The content, instructor, and free price make this a high-value 1.5-hour primer on multimodal embeddings and RAG, but it is short, certificate-less, and (as of 2026) its labs have known breakage from deprecated Google PaLM/Vertex APIs, so it is a clear take only for learners comfortable patching outdated API code.
How we reviewed this course
This is an independent editorial assessment by Cursarium, based on DeepLearning.AI's published course materials and aggregated public learner feedback (last reviewed 2026-06). We have not independently completed the course. Links to providers are standard references, not paid placements.
Sources
- DeepLearning.AI course Q&A: Technical/maintenance update (2026)
- DeepLearning.AI community: Error in L2 code example (deprecated Google API)
- Coursera project page (rating 4.5/5, 43 reviews, ~6,730 enrolled)
- Class Central listing (4.5 rating, course metadata)
- GitHub course notes (verbatim lesson breakdown, tools)