Road to GraphRAG: Vector Search Alone Can't Ground LLMs
Abstract
Despite the hype, most Retrieval-Augmented Generation (RAG) pipelines today are stuck in a shallow loop: embed a question, retrieve some nearby chunks, pass them to the LLM, and hope it makes sense. It works - until it doesn't. Ask anything structurally complex or cognitively demanding, and the cracks start to show. In the following I argue that vector similarity alone is not enough to ground real understanding. The failure isn't the model. It's what we're feeding it.
We unpack the blind spots baked into today's RAG systems: the oversimplified assumptions about question types, the lack of structural awareness in retrieval, and the over-reliance on ever-larger context windows as a crutch. Through a breakdown of five core question types - fact, comparison, bridge, aggregation, and causal - we show how naive vector search repeatedly fails to supply the model with the raw material it actually needs to reason.
What emerges is a clear diagnosis: RAG is still thinking in fragments. To evolve, retrieval must start thinking in systems - graph-native, question-aware, and semantically aligned. Without that shift, generation will always be built on shaky ground. You can't vector your way to understanding.
1. Introduction: Why Most RAG Still Misses the Point
We've hit a ceiling.
Retrieval-Augmented Generation (RAG) was supposed to unlock a new era of grounded, reliable LLMs. In theory, it combines the power of language models with the precision of search. In practice, most RAG pipelines are still stuck in the basics: fetch a few chunks, stuff them into the context window, and hope for the best.
This naive "fetch-and-generate" pattern is cheap to implement and works okay for surface-level questions. But ask anything more involved-a comparison, a causal link, a chain of facts-and you'll feel the cracks. You'll get hallucinations, fragments, or worse, answers that are close enough to feel right but subtly wrong.
This isn't about context window size. Bigger windows help, but they don't fix bad retrieval. In fact, they often just delay the problem-burning tokens and compute on half-relevant chunks. What we need isn't just more context. We need better context.
To move forward, we need to rethink retrieval from the ground up. That requires a much deeper grasp of what retrieval is actually trying to do-what kinds of questions it must answer, what information is truly relevant, and how that relevance depends on structure, not just similarity. Without that clarity, every improvement is just noise.
The following analysis examines the fundamental limitations of today's RAG pipelines. It explores the five key types of questions that characterize real-world use, explains why most retrieval strategies fall short of handling them effectively, and outlines what's still missing from the current retrieval stack. The core argument is simple: today's retrieval systems are largely blind to complexity-and until we address that, our generation layers will keep stumbling.
You can't just "fetch" your way to understanding. You need structure. You need systems. You need strategy.
1.1 Quick Primer: What Even Is Vector Search?
Before we tear it apart, let's quickly define what vector search actually is.
A vector is just a list of numbers that captures the meaning of some input - like a sentence, a paragraph, or a whole document. This is done through embeddings, which are generated by an AI model trained to map similar meanings to nearby points in space.
So instead of searching by exact words, you search by meaning.
Ask "Who was the first person on the moon?" - that question gets turned into a vector. Then the system compares it to a bunch of other vectors (representing documents, chunks, etc.) and finds the ones that are closest in meaning. Not in wording - in intent.
This is the core promise of vector search: instead of matching strings, it matches thoughts.
2. The Spectrum of Question Types
If we want retrieval to be effective, we have to start by understanding the shape of the questions themselves. Most RAG systems treat all questions the same: embed the query, retrieve the closest chunks, feed them into the model. But real questions are more nuanced. They vary in structure, complexity, and what they demand from a retriever. They're not all created equal.
There are five dominant types we see in practice:
Fact Lookup
This is the simplest form. These are direct questions with a clear answer. The system only needs to find one relevant chunk that contains the fact-no reasoning required.
Comparison
These questions require bringing together multiple entities to weigh them side by side. The challenge is not just retrieving facts, but making sure those facts are retrieved together.
Bridge (Multi-hop)
These require connecting information that doesn't co-occur in a single place. It might involve joining details from two or more sources to produce a coherent answer.
Aggregation
This asks for a set of things-like listing all missions with a specific property. These questions demand coverage, not just similarity.
Causal (Reasoning)
These are the most complex. They often require not just retrieving facts but surfacing relationships or explanations that span multiple concepts.
Together, these five types cover the majority of real-world needs. And more importantly, they surface the boundaries of vector-only retrieval. Most failures in RAG aren't generation problems. They're retrieval mismatches. We're using the wrong tools for the wrong jobs-and expecting the model to patch the gaps.
Before you tune your reranker, upgrade your embedding model, or expand your context window, ask this: what kind of question am I really trying to answer?
3. How RAG Typically Works: The Evolution from Naive to Smart Retrieval
The Default: Vanilla RAG
Most RAG systems begin with good intentions and crude tools. The default setup is straightforward: embed the query, perform a cosine search across all document chunks, and pass the top-k into the model. It's simple. It's fast. And it's shallow.
This is what we call vanilla RAG. It doesn't discriminate between types of questions, sources of knowledge, or structural dependencies. It assumes that whatever floats to the top of a similarity search is good enough to answer the question. Sometimes that's true. Often, it's not.
Step One: Smarter Indexes
To improve performance, teams move toward optimized nearest-neighbor search. They swap brute-force scanning for smarter indexes like FAISS, HNSW, or ScaNN. These are graph-based, allowing faster lookups across large corpora. This change boosts speed-but not necessarily quality. It's still vector search. You're still looking for the closest match, not the right structure.
Step Two: Adding Structure
Some teams take it a step further. Instead of treating document chunks as isolated atoms, they add edges-explicit links like citations, hyperlinks, or semantic relationships. Now, after retrieving a few top nodes, the system expands outward along those edges. This creates a more context-aware retrieval process. You're not just finding the nearest answer. You're following a trail of relevance.
From Retrieval to Reasoning
These stages form a kind of evolution:
- From blind similarity to structured navigation
- From isolated chunks to connected knowledge
- From brute force to systems that begin to resemble reasoning
Still Just a Patch
But let's be clear: even at its best, this approach is still anchored in vector space. The upgrades help. They smooth the rough edges. But they don't address the underlying mismatch between the types of questions users ask-and the kind of information vector search is good at finding.
What we have is a series of patches. Useful, necessary, but still insufficient.
4. The Limits of Vector-Only Retrieval
Optimized for Similarity, Not Structure
Vector search is powerful, but it has blind spots. It's optimized for similarity, not structure. For surface resemblance, not semantic depth. That works when the query is simple, the answer is self-contained, and the match is obvious. But real questions aren't always like that.
Vector embeddings are great for finding text that sounds similar - but not always text that's actually relevant. Once you move beyond simple fact lookup, cracks start to show.
The Breakdown by Question Type
Fact Lookups
Fact lookup questions are straightforward-"When did Apollo 11 land?"-and vector search usually performs well. These are direct questions with localized answers, and vectors can reliably retrieve the relevant chunk.
Comparisons
Comparison questions need multiple entities side by side, but vector search usually retrieves one at a time. Chunks might not include both entities, and vector space doesn't encourage that kind of joint relevance. For example, "Which mission was longer: Apollo 11 or Apollo 12?" is likely to return information about one mission, not both.
Bridge (Multi-hop)
Multi-hop or bridge questions require linking facts across chunks-something embeddings can't do without structure. These queries span documents or sections and depend on connecting disconnected data points. For instance, "Which Apollo astronaut had flown a Gemini mission before?" demands inference over multiple sources, which vector ranking alone can't manage.
Aggregation
Aggregation questions need full sets, like "Which missions returned lunar samples?" But vectors rank by similarity, not completeness. Top-k retrieval truncates the list, meaning only the most similar few are returned-not the most representative or exhaustive.
Causal Questions
Causal questions use abstract language, so embedding matches drift toward surface-level results. "Why was Apollo 13 aborted?" demands an explanation that spans context and intent-not just pattern matching. Vector embeddings aren't built to encode causality.
Type Mismatches
Vector space doesn't care if you're asking about a person, date, or event-it just returns what's close. This leads to plausible-sounding but semantically misaligned results. Type mismatches often produce answers that appear correct but break under scrutiny.
4.1 Vector Variants Don't Fix the Core Problem
There are many RAG variants-QA-pair retrieval, Tag-RAG, reranking pipelines-but most suffer from the same issue: they still rely on vector similarity. The interface may change, but the core remains the same, and so do the limitations.
QA-pair methods help with fact lookups by generating and embedding synthetic questions, but they break down on anything multi-hop, aggregated, or ambiguous. Tag-RAG adds metadata and labels to steer retrieval, but that only works when tags are reliable and context is local-it rarely helps with reasoning across chunks. Reranking and fusion techniques can polish the top-k, but they can't fix a bad candidate pool. If the right answer isn't there to begin with, no reranker will save you. And all of them add latency without solving the structural gaps in retrieval.
5. Conclusion
So yes, vector search is good at finding text that sounds like the question. But sounding right and being right are not the same. And the more abstract or structured the question, the wider that gap becomes.
The real failure isn't in the language model. It's in the retrieval step. We're sending it the wrong inputs-and asking it to do too much with too little context.