sajad ghawami
8 July 2025 ยท ~9 min read

Road to GraphRAG: Vector Search Alone Can't Ground LLMs

Abstract

Despite the hype, most Retrieval-Augmented Generation (RAG) pipelines today are stuck in a shallow loop: embed a question, retrieve some nearby chunks, pass them to the LLM, and hope it makes sense. It works - until it doesn't. Ask anything structurally complex or cognitively demanding, and the cracks start to show. In the following I argue that vector similarity alone is not enough to ground real understanding. The failure isn't the model. It's what we're feeding it.

We unpack the blind spots baked into today's RAG systems: the oversimplified assumptions about question types, the lack of structural awareness in retrieval, and the over-reliance on ever-larger context windows as a crutch. Through a breakdown of five core question types - fact, comparison, bridge, aggregation, and causal - we show how naive vector search repeatedly fails to supply the model with the raw material it actually needs to reason.

What emerges is a clear diagnosis: RAG is still thinking in fragments. To evolve, retrieval must start thinking in systems - graph-native, question-aware, and semantically aligned. Without that shift, generation will always be built on shaky ground. You can't vector your way to understanding.

1. Introduction: Why Most RAG Still Misses the Point

We've hit a ceiling.

Retrieval-Augmented Generation (RAG) was supposed to unlock a new era of grounded, reliable LLMs. In theory, it combines the power of language models with the precision of search. In practice, most RAG pipelines are still stuck in the basics: fetch a few chunks, stuff them into the context window, and hope for the best.

This naive "fetch-and-generate" pattern is cheap to implement and works okay for surface-level questions. But ask anything more involved-a comparison, a causal link, a chain of facts-and you'll feel the cracks. You'll get hallucinations, fragments, or worse, answers that are close enough to feel right but subtly wrong.

This isn't about context window size. Bigger windows help, but they don't fix bad retrieval. In fact, they often just delay the problem-burning tokens and compute on half-relevant chunks. What we need isn't just more context. We need better context.

To move forward, we need to rethink retrieval from the ground up. That requires a much deeper grasp of what retrieval is actually trying to do-what kinds of questions it must answer, what information is truly relevant, and how that relevance depends on structure, not just similarity. Without that clarity, every improvement is just noise.

The following analysis examines the fundamental limitations of today's RAG pipelines. It explores the five key types of questions that characterize real-world use, explains why most retrieval strategies fall short of handling them effectively, and outlines what's still missing from the current retrieval stack. The core argument is simple: today's retrieval systems are largely blind to complexity-and until we address that, our generation layers will keep stumbling.

You can't just "fetch" your way to understanding. You need structure. You need systems. You need strategy.

1.1 Quick Primer: What Even Is Vector Search?

Before we tear it apart, let's quickly define what vector search actually is.

A vector is just a list of numbers that captures the meaning of some input - like a sentence, a paragraph, or a whole document. This is done through embeddings, which are generated by an AI model trained to map similar meanings to nearby points in space.

So instead of searching by exact words, you search by meaning.

Ask "Who was the first person on the moon?" - that question gets turned into a vector. Then the system compares it to a bunch of other vectors (representing documents, chunks, etc.) and finds the ones that are closest in meaning. Not in wording - in intent.

This is the core promise of vector search: instead of matching strings, it matches thoughts.

distant

distant

Close in Meaning

Also Close

๐Ÿถ dog

๐Ÿฑ cat

๐Ÿš€ rocket

๐Ÿ›ฐ๏ธ satellite

๐Ÿ›ธ spaceship

2. The Spectrum of Question Types

If we want retrieval to be effective, we have to start by understanding the shape of the questions themselves. Most RAG systems treat all questions the same: embed the query, retrieve the closest chunks, feed them into the model. But real questions are more nuanced. They vary in structure, complexity, and what they demand from a retriever. They're not all created equal.

There are five dominant types we see in practice:

Fact Lookup

This is the simplest form. These are direct questions with a clear answer. The system only needs to find one relevant chunk that contains the fact-no reasoning required.

๐Ÿ” Query: When did Apollo 11 land?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 landed on July 20, 1969

๐Ÿค– LLM Generates Answer: July 20, 1969

Comparison

These questions require bringing together multiple entities to weigh them side by side. The challenge is not just retrieving facts, but making sure those facts are retrieved together.

๐Ÿ” Query: Which mission was longer?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 duration

๐Ÿ“„ Chunk: Apollo 12 duration

๐Ÿค– LLM Compares & Answers

Bridge (Multi-hop)

These require connecting information that doesn't co-occur in a single place. It might involve joining details from two or more sources to produce a coherent answer.

๐Ÿ” Query: Which Apollo astronaut flew a Gemini mission before?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk A: Apollo Missions

๐Ÿ“„ Chunk B: Gemini Crew Assignments

๐Ÿ‘ค Astronaut: John Young

๐Ÿ‘ค Gemini Crew: John Young

๐Ÿง  LLM Links Info

๐Ÿค– Final Answer: John Young flew both

Aggregation

This asks for a set of things-like listing all missions with a specific property. These questions demand coverage, not just similarity.

๐Ÿ” Query: Which missions returned lunar samples?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 - Yes

๐Ÿ“„ Chunk: Apollo 12 - Yes

๐Ÿ“„ Chunk: Apollo 13 - No

๐Ÿ“„ Chunk: Apollo 14 - Yes

๐Ÿง  LLM Aggregates

๐Ÿค– Final Answer: Apollo 11, 12, 14...

Causal (Reasoning)

These are the most complex. They often require not just retrieving facts but surfacing relationships or explanations that span multiple concepts.

๐Ÿ” Query: Why was Apollo 13 aborted?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Oxygen tank exploded

๐Ÿ“„ Chunk: Power failure

๐Ÿ“„ Chunk: Abort decision by NASA

๐Ÿง  LLM Reconstructs Causal Chain

๐Ÿค– Final Answer: Explosion โ†’ Power loss โ†’ Abort

Together, these five types cover the majority of real-world needs. And more importantly, they surface the boundaries of vector-only retrieval. Most failures in RAG aren't generation problems. They're retrieval mismatches. We're using the wrong tools for the wrong jobs-and expecting the model to patch the gaps.

Before you tune your reranker, upgrade your embedding model, or expand your context window, ask this: what kind of question am I really trying to answer?

3. How RAG Typically Works: The Evolution from Naive to Smart Retrieval

The Default: Vanilla RAG

Most RAG systems begin with good intentions and crude tools. The default setup is straightforward: embed the query, perform a cosine search across all document chunks, and pass the top-k into the model. It's simple. It's fast. And it's shallow.

This is what we call vanilla RAG. It doesn't discriminate between types of questions, sources of knowledge, or structural dependencies. It assumes that whatever floats to the top of a similarity search is good enough to answer the question. Sometimes that's true. Often, it's not.

๐Ÿ”ข Embedding Space

๐Ÿ” Query Vector

๐Ÿ“„ Chunk A

๐Ÿ“„ Chunk B

๐Ÿ“„ Chunk C - Similar

๐Ÿ“„ Chunk D - Similar

๐Ÿ“„ Chunk E

๐Ÿงฎ Cosine Similarity Search

๐Ÿ“ฆ Top-k Chunks (C, D)

๐Ÿค– LLM Generates Answer

Step One: Smarter Indexes

To improve performance, teams move toward optimized nearest-neighbor search. They swap brute-force scanning for smarter indexes like FAISS, HNSW, or ScaNN. These are graph-based, allowing faster lookups across large corpora. This change boosts speed-but not necessarily quality. It's still vector search. You're still looking for the closest match, not the right structure.

๐Ÿง  Index (HNSW / FAISS)

๐Ÿ“„ Chunk A

๐Ÿ“„ Chunk B

๐Ÿ“„ Chunk C - Similar

๐Ÿ“„ Chunk D - Similar

๐Ÿ” Query Vector

๐Ÿงญ Traverse Index

๐Ÿ“ฆ Top-k Chunks

๐Ÿค– LLM Generates Answer

Step Two: Adding Structure

Some teams take it a step further. Instead of treating document chunks as isolated atoms, they add edges-explicit links like citations, hyperlinks, or semantic relationships. Now, after retrieving a few top nodes, the system expands outward along those edges. This creates a more context-aware retrieval process. You're not just finding the nearest answer. You're following a trail of relevance.

๐Ÿ” Query Vector

๐Ÿ“ก Initial Vector Search

๐Ÿ“„ Chunk A - Similar

๐Ÿ”— Chunk B - Cited by A

๐Ÿ”— Chunk C - Linked Topic

๐Ÿ“ฆ Expanded Context

๐Ÿค– LLM Generates Answer

From Retrieval to Reasoning

These stages form a kind of evolution:

  • From blind similarity to structured navigation
  • From isolated chunks to connected knowledge
  • From brute force to systems that begin to resemble reasoning

Still Just a Patch

But let's be clear: even at its best, this approach is still anchored in vector space. The upgrades help. They smooth the rough edges. But they don't address the underlying mismatch between the types of questions users ask-and the kind of information vector search is good at finding.

What we have is a series of patches. Useful, necessary, but still insufficient.

4. The Limits of Vector-Only Retrieval

Optimized for Similarity, Not Structure

Vector search is powerful, but it has blind spots. It's optimized for similarity, not structure. For surface resemblance, not semantic depth. That works when the query is simple, the answer is self-contained, and the match is obvious. But real questions aren't always like that.

Vector embeddings are great for finding text that sounds similar - but not always text that's actually relevant. Once you move beyond simple fact lookup, cracks start to show.

The Breakdown by Question Type

Fact Lookups

Fact lookup questions are straightforward-"When did Apollo 11 land?"-and vector search usually performs well. These are direct questions with localized answers, and vectors can reliably retrieve the relevant chunk.

๐Ÿ” Query: When did Apollo 11 land?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 landed on July 20, 1969

๐Ÿค– LLM Generates Answer: July 20, 1969 โœ…

Comparisons

Comparison questions need multiple entities side by side, but vector search usually retrieves one at a time. Chunks might not include both entities, and vector space doesn't encourage that kind of joint relevance. For example, "Which mission was longer: Apollo 11 or Apollo 12?" is likely to return information about one mission, not both.

๐Ÿ” Query: Which mission was longer: Apollo 11 or Apollo 12?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 duration: 8 days

๐Ÿ“„ Chunk: Moon landing summary โ€“ missing Apollo 12

๐Ÿค– LLM Generates Incomplete Answer โŒ

Bridge (Multi-hop)

Multi-hop or bridge questions require linking facts across chunks-something embeddings can't do without structure. These queries span documents or sections and depend on connecting disconnected data points. For instance, "Which Apollo astronaut had flown a Gemini mission before?" demands inference over multiple sources, which vector ranking alone can't manage.

1st Hop

2nd Hop

๐Ÿ” Query: Which Apollo astronaut flew a Gemini mission before?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk A: Apollo astronauts โ€“ John Young listed

๐Ÿค– LLM Sees Partial Info โŒ

๐Ÿ“„ Chunk M: John Young bio

๐Ÿ“„ Chunk B: Gemini 3 mission โ€“ John Young

Aggregation

Aggregation questions need full sets, like "Which missions returned lunar samples?" But vectors rank by similarity, not completeness. Top-k retrieval truncates the list, meaning only the most similar few are returned-not the most representative or exhaustive.

๐Ÿ” Query: Which missions returned lunar samples?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 returned samples โœ…

๐Ÿ“„ Chunk: Apollo 12 returned samples โœ…

๐Ÿ“„ Chunk: Apollo 13 โ€“ no samples โŒ

๐Ÿ“„ Chunk: Apollo 14 returned samples โœ…

๐Ÿ“„ Chunk: Apollo 15 returned samples โœ…

๐Ÿค– LLM Sees Incomplete Set โŒ

Causal Questions

Causal questions use abstract language, so embedding matches drift toward surface-level results. "Why was Apollo 13 aborted?" demands an explanation that spans context and intent-not just pattern matching. Vector embeddings aren't built to encode causality.

๐Ÿ” Query: Why was Apollo 13 aborted?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 13 was aborted mid-mission

๐Ÿ“„ Chunk: Oxygen tank failure on Apollo 13

๐Ÿค– LLM Gets Surface Info Only โŒ

Type Mismatches

Vector space doesn't care if you're asking about a person, date, or event-it just returns what's close. This leads to plausible-sounding but semantically misaligned results. Type mismatches often produce answers that appear correct but break under scrutiny.

๐Ÿ” Query: Who commanded Apollo 11?

๐Ÿง  Embed as Vector

๐Ÿ“ก Vector Search

๐Ÿ“„ Chunk: Apollo 11 mission details

๐Ÿ“„ Chunk: Apollo 11 launch date

๐Ÿ“„ Chunk: Neil Armstrong bio

๐Ÿค– LLM May Miss Commander Info โŒ

4.1 Vector Variants Don't Fix the Core Problem

There are many RAG variants-QA-pair retrieval, Tag-RAG, reranking pipelines-but most suffer from the same issue: they still rely on vector similarity. The interface may change, but the core remains the same, and so do the limitations.

QA-pair methods help with fact lookups by generating and embedding synthetic questions, but they break down on anything multi-hop, aggregated, or ambiguous. Tag-RAG adds metadata and labels to steer retrieval, but that only works when tags are reliable and context is local-it rarely helps with reasoning across chunks. Reranking and fusion techniques can polish the top-k, but they can't fix a bad candidate pool. If the right answer isn't there to begin with, no reranker will save you. And all of them add latency without solving the structural gaps in retrieval.

5. Conclusion

So yes, vector search is good at finding text that sounds like the question. But sounding right and being right are not the same. And the more abstract or structured the question, the wider that gap becomes.

The real failure isn't in the language model. It's in the retrieval step. We're sending it the wrong inputs-and asking it to do too much with too little context.