Authored: 16.04.2025 by Sajad Ghawami
The proposed solution introduces a hybrid architecture enabling Large Language Models (LLMs) to interact naturally and efficiently with tabular data stored in relational databases. By generating type-aware Abstract Syntax Trees (ASTs) as intermediates, the approach scales beyond context-window limits, prevents hallucination, and adapts flexibly to schema changes. Intelligent schema parsing, fuzzy search, and structured translation into SQL ensure robust and accurate querying. The architecture is designed to scale to 1 million rows and beyond, maintaining performance and reliability at large data volumes.
The emergence of LLMs has revolutionized how we think about human-computer interaction. Yet, when it comes to structured data - spreadsheets, CSVs, relational databases - the limitations become quickly apparent. While smaller datasets can be processed in-memory by an LLM, larger datasets present a fundamental bottleneck: the fixed context window.
Despite numerous attempts to bridge this divide, most existing solutions fall short either in generality, scalability, or robustness. Fine-tuning models on specific schemas offers precision but lacks portability. Tool-calling approaches rely heavily on predefined functions, becoming brittle as complexity increases. Meanwhile, approaches that involve direct SQL generation often succumb to hallucinations and syntax errors, with limited safeguards against incorrect or dangerous queries.
This work presents a scalable, modular system that can generalize across tabular datasets of varying structure and size. Central to this system is a multi-layered abstraction pipeline that begins with schema interpretation and ends with a dynamically generated SQL query, mediated by a constrained, type-aware AST. I argue that this approach strikes a critical balance: preserving the flexibility of LLMs while ensuring rigorous control through structured intermediaries.
When faced with the problem of querying structured data via natural language, most engineers turn to familiar tools. And indeed, on the surface, there’s a certain appeal in giving a language model direct access to raw data or crafting custom interfaces through which it can “ask” questions. But as we’ve learned through experience, these solutions quickly buckle under the weight of complexity, variability, and scale.
The most obvious starting point is to simply give the LLM access to the data - say, a CSV or Excel file - and ask it to answer questions. This works surprisingly well on small files. The model can ingest the full dataset into its context window, reason about patterns, and even generate correct responses with a veneer of intelligence. But this only holds for trivial cases. In the real world, datasets are large, messy, and dynamic. They can’t fit in-memory, let alone in the context window of a language model. The illusion of competence breaks down fast.
The next natural step is to move the data into a database. This is not only more scalable but aligns with decades of hard-won engineering wisdom: databases are built to handle structured data efficiently. Here, the goal becomes clear - can we enable the LLM to generate queries that interact with the database on behalf of the user?
One approach is to fine-tune the model on a specific schema, training it to convert natural language into structured queries. This method promises accuracy. It encodes knowledge of the domain into the model itself. But this comes at a cost: fine-tuning is expensive, time-consuming, and tightly coupled to the structure of the data. A single schema change, a renamed column, a new table - and the entire model might require retraining. It is powerful, but inflexible. As a result, it struggles to adapt to new or evolving datasets.
<aside>
A fine-tuned LLM might learn to interpret “total sales per month” correctly for a given dataset, but if the sales column is renamed to txn_value or if a new date format is introduced, it may silently fail or produce invalid output.
</aside>
Another widely used technique is to define a set of functions or tools that the LLM can call. Each function is handcrafted- getTotalSales
or getTopCustomers
- and the model is instructed to select and parameterize the correct tool. This adds a layer of abstraction and improves safety. But it introduces a new problem: complexity balloons quickly. To support every possible user query, you need a library of functions. Aggregations, joins, filters, groupings - the function space grows combinatorially. Supporting open-ended exploration becomes a significant engineering burden.
<aside>
A user might ask, “What’s the average delivery time for repeat customers in the last quarter?” If you haven’t implemented exactly that function, the system fails. And writing generalized aggregation logic that works across arbitrary table structures is a non-trivial problem in itself.
</aside>
In search of flexibility, many systems give the LLM permission to write SQL directly. This seems elegant. SQL is expressive, databases are optimized to execute it, and models can be prompted with schema context. But this approach is precarious. LLMs are not trained to obey syntactic constraints. They hallucinate columns, confuse data types, and generate invalid syntax. Worse, they may issue dangerous queries - accidentally deleting rows or locking up the database with expensive joins.
<aside>
An LLM might generate a query like this, forgetting the proper syntax for wildcards or omitting required quotes. It might assume a users table exists when it doesn't. Guardrails are minimal.
</aside>
A slight variation on this theme involves prompting the model with schema metadata or SQL examples, nudging it to stay within known boundaries. But the LLM is still being asked to learn the rules of SQL, plus the semantics of the schema, all at inference time. Even subtle shifts in language or structure can lead to invalid or misleading queries.
Regardless of the approach, one persistent issue is user input. People make typos. They misspell “Apple” as “Aplpe,” write “revenue” instead of “total_sales,” or mix abbreviations with full terms. These mistakes, trivial for a human to resolve, can derail a machine. Whether you rely on strict function matching or free-form query generation, handling noisy input becomes a fundamental challenge.
Most systems aren’t designed to account for this gracefully. Some attempt fuzzy matching heuristics, but these are often limited in scope or bolted on late in development. The result is a narrow interface that works well only when user input conforms perfectly to expectations. This breaks the illusion of “natural language” and leads to frustrating user experiences.
The limitations of existing approaches aren’t due to lack of ambition - they’re structural. We’re trying to bridge two fundamentally different systems: a language model trained on natural language, and a database built for structured queries. Most methods pick a side: either overfit the model to the data or over-constrain the interface to accommodate the quirks of language. The chosen path here is different.
This solution is a hybrid architecture that embraces both flexibility and rigor. It draws on the adaptability of language models, but grounds their outputs in a system that enforces correctness, type safety, and interpretability. At its core is a translation layer: a simplified, type-aware Abstract Syntax Tree (AST) that acts as an intermediate language between the LLM and the database.
This approach has three key advantages:
We’re no longer asking the model to “speak SQL.” We’re asking it to describe intent within a set of constraints. The AST becomes a contract - one that the model must satisfy, and that the system can reliably transform into executable queries.
The result is a system that generalizes well, remains robust under schema changes, and handles edge cases - like typos, missing values, or partial matches - with composable logic. Instead of writing more code for every new query type, we extend the AST, expand the parser, or update the mapping logic. The model’s role is simplified: describe the what. The system handles the how.
This is the foundation on which the rest of the architecture is built. In the sections that follow, we’ll break down how we prepare data for querying, implement fuzzy search, and use the AST to generate high-performance SQL on the fly.
To enable natural language querying at scale, the system first needs to make sense of the underlying data. CSV and Excel files are simple in format but messy in practice - column names are inconsistent, types are ambiguous, and semantics are often implied rather than stated. The first step is to bring order to this mess. That means understanding the schema, assigning types, and cleaning the data before it ever reaches the model.
We start by analyzing the structure of the uploaded tabular file. Each column is inspected and passed through a language model that generates a column map - a structured representation of each column’s role in the dataset.
The column map includes:
pr-num
→ product_number
)This map forms the backbone of the query interface. It defines how a user’s natural language is grounded in the actual structure of the data. When someone says “sales by product,” we don’t need exact matches - we need intent matched to structure. The column map provides that bridge.
Once the structure is understood, the data itself must be normalized. We parse the tabular data according to the column map and cast each value to its correct type. Dates are standardized, booleans are inferred, numbers are normalized - even when the source data is inconsistent.
To support varied and heterogeneous data structures in a single system, we store rows as jsonb
in PostgreSQL. This has a few key advantages:
But storing data is only half the challenge. The real gains come from how we index it.
Traditional indexes in PostgreSQL attempt to index every row as a uniform structure. That breaks down when the data varies across rows or tables. Instead, we use sparse GIN indexes - each one tailored to a specific column map or table structure.
This approach:
In short, we create a structured substrate for the LLM to work with - where every column has meaning, every value is typed, and every variation is accounted for.
Every fuzzy match is contextualized by the column map. We don’t just search blindly across values - we understand what each field represents, how it’s typed, and what synonyms it might include. This gives the fuzzy search engine semantic awareness: it knows that “revenue,” “income,” and “net sales” might all point to the same field, even if the user didn’t get the wording exactly right.
Fuzzy matching, in this system, isn’t an afterthought. It’s an integral part of making structured data feel conversational. It gives users the freedom to express intent naturally - without having to worry about exact spellings, field names, or database structure.
At the heart of this system is a simple goal: a user should be able to ask a question in plain language and receive a correct, structured answer -
without needing to know anything about SQL, schemas, or data formats. But achieving this requires more than just translating English into code. It demands a robust communication protocol between the language model and the database.
Direct SQL generation is tempting. It gives the impression of intelligence - converting "Show me total sales by region last year" into a query that runs instantly. But in practice, it’s unreliable. Models hallucinate table names, confuse data types, and produce syntax errors. They don’t understand the boundaries of what’s possible in a given dataset - and they have no internal guardrails to stop them from trying.
Even when prompted with examples or schema documentation, the model is still guessing at query construction, often stringing together syntax that “looks right” rather than obeying logic that is right. And there’s little visibility into why a query was constructed the way it was, making debugging painful.
Instead of generating SQL directly, we ask the LLM to generate an Abstract Syntax Tree (AST) - a structured, typed representation of the intended query. Each node in the AST corresponds to a semantic unit: a filter, a group-by operation, a sorting directive, or an aggregation. It’s like asking the model to sketch the shape of the question, not write the exact code.
This AST is:
I experimented with PostgreSQL’s native AST format but found it too complex and verbose for our needs. Instead, we can built a simplified schema - purpose-built for clarity and transformation - that includes casting rules, fuzzy search markers, and aggregation hints.
Once the AST is generated, we transform it into SQL through a deterministic compiler. This step enforces correctness: numeric values are treated as numbers, strings are quoted and escaped, and similarity operators are inserted where needed.
For example, if the AST specifies a fuzzy match on a customer name, the generated SQL includes both trigram similarity and Levenshtein distance. If a filter involves a date range, the compiler ensures correct casting and timezone handling.
This layer of transformation acts as a final safeguard - one that guarantees all queries are syntactically valid, structurally sound, and fully optimized for execution.
This design flips the standard model-query interaction on its head. Rather than asking the LLM to produce perfect output in a fragile format, we give it a well-defined space to express intent. The AST acts as both a constraint and a guide. It lets the model operate with freedom, but within safe and predictable boundaries.
The result is a system that’s fast, interpretable, and resilient to variation in user input. A question like “Which customers spent the most in Q4?” doesn’t require brittle pattern matching or a fragile SQL generation pipeline - it’s mapped to an AST, compiled to a query, and executed with confidence.
We began with a simple question: how can we let people talk to their data? What followed was a deeper recognition - that the answer doesn’t lie in more prompts, bigger models, or one-off hacks. It lies in architecture. In system design that respects the strengths of each component: the fluidity of natural language, the structure of tabular data, and the precision of relational databases.
This hybrid approach works because it acknowledges - and resolves - the key tensions in the problem space:
At every layer, there’s a balance between expressiveness and safety:
This isn’t just a clever workaround - it’s a scalable pattern. One that’s modular, testable, and ready to be extended. Whether you’re working with a handful of spreadsheets or an enterprise-wide data lake, this architecture adapts. You can add support for relationships, plug in new data sources, or extend the AST with new operations. The foundation remains solid.
Natural language querying doesn’t have to be a black box. With the right abstractions, it becomes a clear, traceable, and robust layer in your data stack - one that truly unlocks the value of LLMs.
This system doesn’t just solve the immediate problem - it sets the stage for what comes next. It’s a strong, production-ready foundation: modular, schema-aware, and grounded in proven infrastructure. And because it’s built with flexibility at its core, it opens the door to deeper insight, more intelligent behavior, and richer interaction with data.