Multi-Modal RAG: When Your Knowledge Base Has Images, Tables, and Code

The Text Illusion

Here's a fun experiment. Take any real enterprise knowledge base and count what percentage of the information lives in pure text paragraphs.

It's never as much as you think.

Financial reports have tables. Architecture docs have diagrams. SOPs have flowcharts. Training materials have screenshots. Sales decks have charts. Engineering docs have code blocks and system diagrams. Meeting notes reference whiteboard photos.

Standard RAG ignores all of this. It extracts the text, embeds the text, retrieves the text. The table that answers the user's question? Gone. The architecture diagram that would have made the response actually useful? Invisible.

Multi-modal RAG fixes this. And it's more tractable than you might expect.

The Three Modalities That Matter

In enterprise settings, you're dealing with three non-text modalities that contain critical information:

Tables. Financial data, comparison matrices, specifications, schedules, permission matrices. Tables encode structured relationships that text descriptions butcher. This connects directly to chunking strategies for mixed content.

Images. Architecture diagrams, flowcharts, screenshots, charts, graphs, photos of whiteboards, scanned handwritten notes.

Code. Code blocks in documentation, configuration files, API examples, scripts, SQL queries, infrastructure-as-code.

Each needs a different strategy.

Tables: The Most Underserved Modality

Tables are everywhere in enterprise docs and they're the modality most commonly destroyed by naive RAG pipelines.

A PDF extraction that turns a 10-column pricing table into a paragraph of jumbled text is worse than useless. It's confidently wrong.

Strategy 1: Table Detection and Structured Extraction

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="financial_report.pdf",
    strategy="hi_res",  # uses layout detection model
    infer_table_structure=True,  # extract tables as HTML
)

tables = [el for el in elements if el.category == "Table"]
for table in tables:
    # table.metadata.text_as_html contains structured HTML
    # table.metadata.page_number tells you where it came from
    print(table.metadata.text_as_html)

The unstructured library (and similar tools like Docling, Azure Document Intelligence, or Amazon Textract) can detect tables in PDFs and extract them as structured HTML or markdown.

Strategy 2: Table Summarization for Embedding

Raw table HTML doesn't embed well. The embedding model doesn't understand that row 3, column 5 contains the Q3 revenue for the EMEA region.

Solution: generate a natural language summary of each table for embedding, but store the original structured table for retrieval.

TABLE_SUMMARY_PROMPT = """
Summarize this table in natural language. Include:
- What the table shows (topic/purpose)
- Key data points and relationships
- Column headers and what they represent
- Notable values or trends

Table:
{table_html}

Summary:
"""

async def process_table(table_html, llm):
    summary = await llm.generate(
        TABLE_SUMMARY_PROMPT.format(table_html=table_html)
    )
    return {
        "embedding_text": summary,     # embed this
        "retrieval_content": table_html,  # return this to LLM
        "type": "table",
    }

The summary gets embedded and used for retrieval matching. The original table HTML gets passed to the LLM for answer generation. Best of both worlds.

Images: From Pixels to Retrievable Knowledge

Strategy 1: Vision Model Descriptions

Use a vision-language model to generate text descriptions of images. Embed the descriptions.

import base64
from openai import OpenAI

client = OpenAI()

def describe_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": (
                    "Describe this image in detail for a knowledge base. "
                    "Include all text visible in the image, relationships "
                    "between elements, and the purpose/context of the diagram."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{image_b64}"
                }},
            ],
        }],
    )
    return response.choices[0].message.content

This works well for diagrams, flowcharts, and screenshots. The description captures the semantic content. Architecture diagrams become retrievable by their components and relationships.

Strategy 2: Multi-Modal Embeddings

Models like CLIP, SigLIP, or Nomic's multi-modal embeddings can embed images and text into the same vector space. Query with text, retrieve images directly.

from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("nomic-ai/nomic-embed-vision-v1.5")

# Embed an image
image = Image.open("architecture_diagram.png")
image_embedding = model.encode(image)

# Embed a text query
query_embedding = model.encode("microservices architecture with message queue")

# These live in the same vector space - compare directly
similarity = cosine_similarity(query_embedding, image_embedding)

This is faster than generating descriptions (no LLM call at ingestion) but less precise for complex diagrams. The embedding captures visual similarity, not semantic understanding of what the diagram means. For a deeper look, see hybrid retrieval pipelines.

The Hybrid Approach (Recommended)

Do both. Generate a text description for keyword/semantic search. Store the multi-modal embedding for visual similarity search. Keep the original image for the LLM to examine during generation.

def process_image(image_path, llm, vision_embedder, text_embedder):
    description = describe_image(image_path)
    return {
        "text_embedding": text_embedder.embed(description),
        "vision_embedding": vision_embedder.embed(image_path),
        "description": description,
        "image_path": image_path,
        "type": "image",
    }

Code: The Forgotten Modality

Code blocks in documentation are often the most useful part. The API example. The config snippet. The SQL query. The Terraform module.

Strategy 1: Code-Aware Chunking

Don't let your chunker split a code block across two chunks. Ever.

def code_aware_chunk(document):
    """Treat code blocks as atomic units."""
    segments = split_on_code_fences(document)
    chunks = []

    for segment in segments:
        if segment.is_code:
            # Code block stays intact, paired with its context
            chunks.append({
                "text": f"{segment.preceding_text}\n\n```{segment.language}\n{segment.code}\n```",
                "type": "code",
                "language": segment.language,
            })
        else:
            # Regular text gets normal chunking
            text_chunks = recursive_split(segment.text)
            chunks.extend(text_chunks)

    return chunks

Strategy 2: Code-Specific Embeddings

General text embedding models don't handle code well. They might match a Python function to a JavaScript function doing something completely different because both use the word "function."

Models like CodeBERT, StarCoder embeddings, or Voyage's code embedding model understand code structure and semantics.

# For code-heavy knowledge bases, use a code-aware embedding model
from voyageai import Client

voyage = Client()

code_embedding = voyage.embed(
    ["def calculate_tax(income, rate): return income * rate"],
    model="voyage-code-3",
).embeddings[0]

Putting It All Together

A multi-modal RAG pipeline looks like this:

Document Ingestion
├── Text → chunk → embed (text model)
├── Tables → extract → summarize → embed summary, store original
├── Images → describe → embed description, store image
└── Code → extract intact → embed (code model), store original

Query Time
├── Text query → embed → search all indexes
├── Fuse results across modalities
├── Rerank with cross-encoder
└── Pass to LLM with full context (text + tables + images + code)

The LLM generation step is where multi-modal models shine. GPT-4o, Claude, and Gemini can all process images alongside text. So your retrieval step can return an architecture diagram, and the LLM can actually look at it while generating the answer. This connects directly to evaluating retrieval quality.

The Practical Reality

Multi-modal RAG adds complexity. More processing at ingestion. More storage. More index types to manage. Is it worth it?

If your knowledge base is pure text docs, no. Standard RAG is fine.

If your knowledge base has tables, diagrams, code examples, or any non-text content that users actually need, then skipping multi-modal means your RAG system literally cannot answer a significant percentage of questions. Users will ask, the system will hallucinate or say "I don't know," and they'll stop trusting it.

Start with tables. They're the highest-value, most common non-text modality in enterprise docs, and the extraction tooling is mature. Add image understanding for diagram-heavy domains. Add code-specific handling for technical documentation.

The goal isn't to build the most sophisticated pipeline. It's to make sure your RAG system can actually find the answer when it exists in your knowledge base, regardless of what format it's in.