Road to GraphRAG, Part 1: Vector Search Alone Can’t Ground LLMs

Authored: 08.07.2025 by Sajad Ghawami

Abstract

Despite the hype, most Retrieval-Augmented Generation (RAG) pipelines today are stuck in a shallow loop: embed a question, retrieve some nearby chunks, pass them to the LLM, and hope it makes sense. It works - until it doesn’t. Ask anything structurally complex or cognitively demanding, and the cracks start to show. In the following I argue that vector similarity alone is not enough to ground real understanding. The failure isn’t the model. It’s what we’re feeding it.

We unpack the blind spots baked into today’s RAG systems: the oversimplified assumptions about question types, the lack of structural awareness in retrieval, and the over-reliance on ever-larger context windows as a crutch. Through a breakdown of five core question types - fact, comparison, bridge, aggregation, and causal - we show how naive vector search repeatedly fails to supply the model with the raw material it actually needs to reason.

What emerges is a clear diagnosis: RAG is still thinking in fragments. To evolve, retrieval must start thinking in systems - graph-native, question-aware, and semantically aligned. Without that shift, generation will always be built on shaky ground. You can’t vector your way to understanding.

1. Introduction: Why Most RAG Still Misses the Point

We've hit a ceiling.

Retrieval-Augmented Generation (RAG) was supposed to unlock a new era of grounded, reliable LLMs. In theory, it combines the power of language models with the precision of search. In practice, most RAG pipelines are still stuck in the basics: fetch a few chunks, stuff them into the context window, and hope for the best.

This naive "fetch-and-generate" pattern is cheap to implement and works okay for surface-level questions. But ask anything more involved-a comparison, a causal link, a chain of facts-and you'll feel the cracks. You'll get hallucinations, fragments, or worse, answers that are close enough to feel right but subtly wrong.

This isn't about context window size. Bigger windows help, but they don't fix bad retrieval. In fact, they often just delay the problem-burning tokens and compute on half-relevant chunks. What we need isn't just more context. We need better context.

To move forward, we need to rethink retrieval from the ground up. That requires a much deeper grasp of what retrieval is actually trying to do-what kinds of questions it must answer, what information is truly relevant, and how that relevance depends on structure, not just similarity. Without that clarity, every improvement is just noise.

The following analysis examines the fundamental limitations of today's RAG pipelines. It explores the five key types of questions that characterize real-world use, explains why most retrieval strategies fall short of handling them effectively, and outlines what's still missing from the current retrieval stack. The core argument is simple: today's retrieval systems are largely blind to complexity-and until we address that, our generation layers will keep stumbling.

You can't just “fetch” your way to understanding. You need structure. You need systems. You need strategy.

1.1 Quick Primer: What Even Is Vector Search?

Before we tear it apart, let’s quickly define what vector search actually is.

A vector is just a list of numbers that captures the meaning of some input - like a sentence, a paragraph, or a whole document. This is done through embeddings, which are generated by an AI model trained to map similar meanings to nearby points in space.

So instead of searching by exact words, you search by meaning.

Ask “Who was the first person on the moon?” - that question gets turned into a vector. Then the system compares it to a bunch of other vectors (representing documents, chunks, etc.) and finds the ones that are closest in meaning. Not in wording - in intent.

This is the core promise of vector search: instead of matching strings, it matches thoughts.

flowchart LR
    subgraph Close in Meaning
        A1([🚀 rocket])
        A2([🛰️ satellite])
        A3([🛸 spaceship])
        A1 --> A2
        A1 --> A3
        A2 --> A3
    end

    subgraph Also Close
        B1([🐶 dog])
        B2([🐱 cat])
        B1 --> B2
    end

    A1 -.->|distant| B1
    A3 -.->|distant| B2

    style A1 fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style A2 fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style A3 fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style B1 fill:#1e1e1e,stroke:#cc6699,stroke-width:2px,color:#ffffff
    style B2 fill:#1e1e1e,stroke:#cc6699,stroke-width:2px,color:#ffffff

2. The Spectrum of Question Types

If we want retrieval to be effective, we have to start by understanding the shape of the questions themselves. Most RAG systems treat all questions the same: embed the query, retrieve the closest chunks, feed them into the model. But real questions are more nuanced. They vary in structure, complexity, and what they demand from a retriever. They're not all created equal.

There are five dominant types we see in practice:

Fact Lookup

This is the simplest form. These are direct questions with a clear answer. The system only needs to find one relevant chunk that contains the fact-no reasoning required.

flowchart LR
    A([🔍 Query: When did Apollo 11 land?]) --> B([🧠 Embed as Vector])
    B --> C([📡 Vector Search])
    C --> D([📄 Chunk: Apollo 11 landed on July 20, 1969])
    D --> E([🤖 LLM Generates Answer: July 20, 1969])

    style A fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style D fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Comparison

These questions require bringing together multiple entities to weigh them side by side. The challenge is not just retrieving facts, but making sure those facts are retrieved together.

flowchart LR
    A([🔍 Query: Which mission was longer?]) --> B([🧠 Embed as Vector])
    B --> C([📡 Vector Search])
    C --> D1([📄 Chunk: Apollo 11 duration])
    C --> D2([📄 Chunk: Apollo 12 duration])
    D1 --> E([🤖 LLM Compares & Answers])
    D2 --> E

    style A fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style D1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D2 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Bridge (Multi-hop)

These require connecting information that doesn't co-occur in a single place. It might involve joining details from two or more sources to produce a coherent answer.

flowchart LR
    A([🔍 Query: Which Apollo astronaut flew a Gemini mission before?]) --> B([🧠 Embed as Vector])
    B --> C([📡 Vector Search])
    
    C --> D1([📄 Chunk A: Apollo Missions])
    C --> D2([📄 Chunk B: Gemini Crew Assignments])
    
    D1 --> E1([👤 Astronaut: John Young])
    D2 --> E2([👤 Gemini Crew: John Young])

    E1 --> F([🧠 LLM Links Info])
    E2 --> F
    F --> G([🤖 Final Answer: John Young flew both])

    style A fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style D1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D2 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style E1 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style E2 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style F fill:#1e1e1e,stroke:#ff66cc,stroke-width:2px,color:#ffffff
    style G fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Aggregation

This asks for a set of things-like listing all missions with a specific property. These questions demand coverage, not just similarity.

flowchart LR
    A([🔍 Query: Which missions returned lunar samples?]) --> B([🧠 Embed as Vector])
    B --> C([📡 Vector Search])
    
    C --> D1([📄 Chunk: Apollo 11 - Yes])
    C --> D2([📄 Chunk: Apollo 12 - Yes])
    C --> D3([📄 Chunk: Apollo 13 - No])
    C --> D4([📄 Chunk: Apollo 14 - Yes])

    D1 --> E([🧠 LLM Aggregates])
    D2 --> E
    D3 --> E
    D4 --> E

    E --> F([🤖 Final Answer: Apollo 11, 12, 14])

    style A fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style D1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D2 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D3 fill:#1e1e1e,stroke:#cc6666,stroke-width:2px,color:#ffffff
    style D4 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff

    style E fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style F fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Causal (Reasoning)

These are the most complex. They often require not just retrieving facts but surfacing relationships or explanations that span multiple concepts.

flowchart LR
    A([🔍 Query: Why was Apollo 13 aborted?]) --> B([🧠 Embed as Vector])
    B --> C([📡 Vector Search])

    C --> D1([📄 Chunk: Oxygen tank exploded])
    C --> D2([📄 Chunk: Power failure])
    C --> D3([📄 Chunk: Abort decision by NASA])

    D1 --> E([🧠 LLM Reconstructs Causal Chain])
    D2 --> E
    D3 --> E

    E --> F([🤖 Final Answer: Explosion → Power loss → Abort])

    style A fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style D1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D2 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D3 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff

    style E fill:#1e1e1e,stroke:#ff66cc,stroke-width:2px,color:#ffffff
    style F fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Together, these five types cover the majority of real-world needs. And more importantly, they surface the boundaries of vector-only retrieval. Most failures in RAG aren't generation problems. They're retrieval mismatches. We're using the wrong tools for the wrong jobs-and expecting the model to patch the gaps.

Before you tune your reranker, upgrade your embedding model, or expand your context window, ask this: what kind of question am I really trying to answer?

3. How RAG Typically Works: The Evolution from Naive to Smart Retrieval

The Default: Vanilla RAG

Most RAG systems begin with good intentions and crude tools. The default setup is straightforward: embed the query, perform a cosine search across all document chunks, and pass the top-k into the model. It's simple. It's fast. And it's shallow.

This is what we call vanilla RAG. It doesn't discriminate between types of questions, sources of knowledge, or structural dependencies. It assumes that whatever floats to the top of a similarity search is good enough to answer the question. Sometimes that's true. Often, it's not.

flowchart LR
    subgraph A["🔢 Embedding Space"]
        Q([🔍 Query Vector])
        D1([📄 Chunk A])
        D2([📄 Chunk B])
        D3([📄 Chunk C - Similar])
        D4([📄 Chunk D - Similar])
        D5([📄 Chunk E])
    end

    Q --> S["🧮 Cosine Similarity Search"]
    S --> K["📦 Top-k Chunks (C, D)"]
    K --> LLM([🤖 LLM Generates Answer])

    style A fill:#1e1e1e,stroke:#444444,stroke-width:1px,color:#ffffff
    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style D1 fill:#1e1e1e,stroke:#333333,stroke-width:1px,color:#ffffff
    style D2 fill:#1e1e1e,stroke:#333333,stroke-width:1px,color:#ffffff
    style D3 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D4 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D5 fill:#1e1e1e,stroke:#333333,stroke-width:1px,color:#ffffff
    style S fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style K fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Step One: Smarter Indexes

To improve performance, teams move toward optimized nearest-neighbor search. They swap brute-force scanning for smarter indexes like FAISS, HNSW, or ScaNN. These are graph-based, allowing faster lookups across large corpora. This change boosts speed-but not necessarily quality. It's still vector search. You're still looking for the closest match, not the right structure.

flowchart LR
    Q([🔍 Query Vector]) --> T([🧭 Traverse Index])

    subgraph "🧠 Index (HNSW / FAISS)"
        A([📄 Chunk A])
        B([📄 Chunk B])
        C([📄 Chunk C - Similar])
        D([📄 Chunk D - Similar])
    end

    T --> C
    T --> D
    C & D --> K([📦 Top-k Chunks])
    K --> LLM([🤖 LLM Generates Answer])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style T fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style A fill:#1e1e1e,stroke:#333333,stroke-width:1px,color:#ffffff
    style B fill:#1e1e1e,stroke:#333333,stroke-width:1px,color:#ffffff
    style C fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style D fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style K fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

Step Two: Adding Structure

Some teams take it a step further. Instead of treating document chunks as isolated atoms, they add edges-explicit links like citations, hyperlinks, or semantic relationships. Now, after retrieving a few top nodes, the system expands outward along those edges. This creates a more context-aware retrieval process. You're not just finding the nearest answer. You're following a trail of relevance.

flowchart LR
    Q([🔍 Query Vector]) --> S([📡 Initial Vector Search])
    S --> N1([📄 Chunk A - Similar])
    N1 --> N2([🔗 Chunk B - Cited by A])
    N1 --> N3([🔗 Chunk C - Linked Topic])
    
    N2 & N3 --> K([📦 Expanded Context])
    K --> LLM([🤖 LLM Generates Answer])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style S fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style N1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style N2 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style N3 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style K fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#6666ff,stroke-width:2px,color:#ffffff

From Retrieval to Reasoning

These stages form a kind of evolution:

From blind similarity to structured navigation
From isolated chunks to connected knowledge
From brute force to systems that begin to resemble reasoning

Still Just a Patch

But let's be clear: even at its best, this approach is still anchored in vector space. The upgrades help. They smooth the rough edges. But they don't address the underlying mismatch between the types of questions users ask-and the kind of information vector search is good at finding.

What we have is a series of patches. Useful, necessary, but still insufficient.

4. The Limits of Vector-Only Retrieval

Optimized for Similarity, Not Structure

Vector search is powerful, but it has blind spots. It's optimized for similarity, not structure. For surface resemblance, not semantic depth. That works when the query is simple, the answer is self-contained, and the match is obvious. But real questions aren't always like that.

Vector embeddings are great for finding text that sounds similar - but not always text that's actually relevant. Once you move beyond simple fact lookup, cracks start to show.

The Breakdown by Question Type

Fact Lookups

Fact lookup questions are straightforward-"When did Apollo 11 land?"-and vector search usually performs well. These are direct questions with localized answers, and vectors can reliably retrieve the relevant chunk.

flowchart LR
    Q([🔍 Query: When did Apollo 11 land?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])
    V --> C([📄 Chunk: Apollo 11 landed on July 20, 1969])
    C --> LLM([🤖 LLM Generates Answer: July 20, 1969 ✅])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff

Comparisons

Comparison questions need multiple entities side by side, but vector search usually retrieves one at a time. Chunks might not include both entities, and vector space doesn't encourage that kind of joint relevance. For example, "Which mission was longer: Apollo 11 or Apollo 12?" is likely to return information about one mission, not both.

flowchart LR
    Q([🔍 Query: Which mission was longer: Apollo 11 or Apollo 12?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])

    V --> C1([📄 Chunk: Apollo 11 duration: 8 days])
    C2([📄 Chunk: Moon landing summary – missing Apollo 12])
    
    C1 --> LLM([🤖 LLM Generates Incomplete Answer ❌])
    C2 --> LLM

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style C2 fill:#1e1e1e,stroke:#cc6666,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#cc3333,stroke-width:2px,color:#ffffff
    
    %% Remove arrow from C2 to LLM
    linkStyle 4 stroke:#cc6666

Bridge (Multi-hop)

Multi-hop or bridge questions require linking facts across chunks-something embeddings can't do without structure. These queries span documents or sections and depend on connecting disconnected data points. For instance, "Which Apollo astronaut had flown a Gemini mission before?" demands inference over multiple sources, which vector ranking alone can't manage.

flowchart LR
    Q([🔍 Query: Which Apollo astronaut flew a Gemini mission before?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])

    V --> A([📄 Chunk A: Apollo astronauts – John Young listed])
    A --> LLM([🤖 LLM Sees Partial Info ❌])

    V -.->|1st Hop| M([📄 Chunk M: John Young bio])
    M -.->|2nd Hop| B([📄 Chunk B: Gemini 3 mission – John Young])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style A fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style M fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style B fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff

    style LLM fill:#1e1e1e,stroke:#cc3333,stroke-width:2px,color:#ffffff

Aggregation

Aggregation questions need full sets, like "Which missions returned lunar samples?" But vectors rank by similarity, not completeness. Top-k retrieval truncates the list, meaning only the most similar few are returned-not the most representative or exhaustive.

flowchart LR
    Q([🔍 Query: Which missions returned lunar samples?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])

    V --> C1([📄 Chunk: Apollo 11 returned samples ✅])
    V --> C2([📄 Chunk: Apollo 12 returned samples ✅])
    V --> C3([📄 Chunk: Apollo 13 – no samples ❌])

    %% These are missing because they didn’t make the top-k
    V -.-> C4([📄 Chunk: Apollo 14 returned samples ✅])
    V -.-> C5([📄 Chunk: Apollo 15 returned samples ✅])

    C1 & C2 & C3 --> LLM([🤖 LLM Sees Incomplete Set ❌])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style C1 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style C2 fill:#1e1e1e,stroke:#00cc66,stroke-width:2px,color:#ffffff
    style C3 fill:#1e1e1e,stroke:#cc6666,stroke-width:2px,color:#ffffff
    style C4 fill:#1e1e1e,stroke:#cccccc,stroke-dasharray: 5 5,color:#ffffff
    style C5 fill:#1e1e1e,stroke:#cccccc,stroke-dasharray: 5 5,color:#ffffff

    style LLM fill:#1e1e1e,stroke:#cc3333,stroke-width:2px,color:#ffffff

Causal Questions

Causal questions use abstract language, so embedding matches drift toward surface-level results. "Why was Apollo 13 aborted?" demands an explanation that spans context and intent-not just pattern matching. Vector embeddings aren't built to encode causality.

flowchart LR
    Q([🔍 Query: Why was Apollo 13 aborted?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])

    %% Retrieved chunks sound relevant but aren't causal
    V --> C1([📄 Chunk: Apollo 13 was aborted mid-mission])
    V --> C2([📄 Chunk: Oxygen tank was damaged])

    %% The full causal chain is not retrieved
    V -.-> C3([📄 Chunk: Tank exploded → power loss → abort])

    C1 & C2 --> LLM([🤖 LLM Gives Shallow Answer ❌])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff

    style C1 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style C2 fill:#1e1e1e,stroke:#ffaa00,stroke-width:2px,color:#ffffff
    style C3 fill:#1e1e1e,stroke:#cccccc,stroke-dasharray: 5 5,color:#ffffff

    style LLM fill:#1e1e1e,stroke:#cc3333,stroke-width:2px,color:#ffffff

Type Mismatches

Vector space doesn't care if you're asking about a person, date, or event-it just returns what's close. This leads to plausible-sounding but semantically misaligned results. Type mismatches often produce answers that appear correct but break under scrutiny.

flowchart LR
    Q([🔍 Query: When did Apollo 11 launch?]) --> E([🧠 Embed as Vector])
    E --> V([📡 Vector Search])

    %% Retrieved chunk is topically similar but of the wrong type
    V --> C1([📄 Chunk: Apollo 11 commander was Neil Armstrong])

    C1 --> LLM([🤖 LLM Generates Wrong-Type Answer: Neil Armstrong ❌])

    style Q fill:#1e1e1e,stroke:#999600,stroke-width:2px,color:#ffffff
    style E fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style V fill:#1e1e1e,stroke:#3399cc,stroke-width:2px,color:#ffffff
    style C1 fill:#1e1e1e,stroke:#cc6666,stroke-width:2px,color:#ffffff
    style LLM fill:#1e1e1e,stroke:#cc3333,stroke-width:2px,color:#ffffff

4.1 Vector Variants Don't Fix the Core Problem

There are many RAG variants-QA-pair retrieval, Tag-RAG, reranking pipelines-but most suffer from the same issue: they still rely on vector similarity. The interface may change, but the core remains the same, and so do the limitations.

QA-pair methods help with fact lookups by generating and embedding synthetic questions, but they break down on anything multi-hop, aggregated, or ambiguous. Tag-RAG adds metadata and labels to steer retrieval, but that only works when tags are reliable and context is local-it rarely helps with reasoning across chunks. Reranking and fusion techniques can polish the top-k, but they can't fix a bad candidate pool. If the right answer isn't there to begin with, no reranker will save you. And all of them add latency without solving the structural gaps in retrieval.

5. Conclusion

So yes, vector search is good at finding text that sounds like the question. But sounding right and being right are not the same. And the more abstract or structured the question, the wider that gap becomes.

The real failure isn't in the language model. It's in the retrieval step. We're sending it the wrong inputs-and asking it to do too much with too little context.

blog