The Context Problem: Why AI Database Agents Hallucinate — and How Query Logs Might Fix It

Miro's data team discovered that AI agents querying live databases failed more often than not — not because of weak models, but because they lacked the contextual scaffolding that human analysts rely on. The fix may be sitting in every organization's SQL query logs.

By Moemedi Michael Poncanaglobal4-minute read28 May 2026☆ Save ↗ Share ⎙ Print

When Miro's data engineering team began routing AI agents directly against their Snowflake data warehouse, the results were humbling. Across a series of internal benchmarks, the agents returned wrong answers more than 65 percent of the time — failures that ranged from incorrect aggregations to entirely hallucinated table joins. The models were not the problem. The context was.

The finding, documented by Miro's data team and reported by VentureBeat on 28 May 2026, represents a concrete data point in a broader reckoning inside engineering organizations: autonomous agents are powerful in sandboxed demonstrations and brittle in production environments where database schemas are sprawling, naming conventions are inconsistent, and the semantic relationships between tables require institutional knowledge the model never had.

The Context Gap in Database Agents

Modern AI database agents operate by translating natural-language questions into SQL queries. The architecture is elegant in theory: a model receives a user prompt, infers intent, generates a query against the target schema, and returns results. In practice, this pipeline assumes the model understands the database's logical structure — which tables exist, how they relate, what column names mean in context, and which historical queries established the patterns analysts use to validate outputs.

None of that context arrives with the prompt. When an agent receives a question like "What was our revenue growth in Southeast Asia last quarter?", it must infer which tables contain revenue data, how geography is encoded, and what definition of "growth" the organization typically uses — decisions that a human analyst makes from years of accumulated database familiarity.

Miro's team found that simply increasing model capability did not close this gap. Upgrading from one frontier model to another left error rates largely unchanged. The agents were not failing because they were not smart enough. They were failing because they were flying blind.

Query Logs as Institutional Memory

The intervention Miro's team tested was straightforward in concept: feed the agent historical SQL queries alongside the current question. Every organization with a mature analytics practice generates a continuous stream of these logs — records of which questions were asked, which queries were written to answer them, and which results were accepted or corrected. That history encodes exactly the institutional knowledge the agent lacks.

When Miro's agents received relevant prior queries as additional context, error rates dropped substantially. The agents could see how a previous analyst had structured a similar revenue question, which tables they had joined, and how they had filtered by region. The hallucinated table joins disappeared not because the model improved, but because the agent had access to the patterns humans had already established.

This approach — using query logs as a retrieval-augmented context layer — reframes the problem of agent reliability. Rather than requiring models to memorize database schemas during training, organizations can expose agents to the living record of how those schemas are actually used. The logs function as a form of organizational memory that survives staff turnover and documents the implicit conventions no schema diagram captures.

What This Means for the Autonomous Agent Thesis

The broader case for AI agents rests on the premise that automation can extend from simple, well-defined tasks into the complex, judgment-laden work of knowledge workers. Database querying sits at an awkward midpoint: structured enough to be automatable in principle, but dependent on contextual understanding that current models cannot reliably derive from schema metadata alone.

Miro's findings suggest a viable path forward that does not require waiting for the next generation of foundation models. Organizations with established analytics practices already possess the key input — years of logged queries that represent hard-won institutional knowledge. Making that knowledge accessible to agents is primarily an engineering and data architecture challenge, not a frontier model research problem.

The implications for enterprise AI deployment are not trivial. If database agents can be stabilized through context enrichment rather than capability scaling, the path to production reliability narrows considerably. Engineering teams do not need to wait for a breakthrough in reasoning; they need to build better retrieval pipelines around their existing query history.

Open Questions and Industry Adoption

Several caveats deserve attention. Miro's benchmarks were conducted against a specific database architecture and a defined set of query types. Whether the gains from query-log augmentation generalize across different schema designs, different domain vocabularies, or different agent architectures remains an open empirical question. The AI engineering community has not yet converged on standardized benchmarks for database agent reliability, making cross-organization comparisons difficult.

There is also the question of log quality. Query logs encode both correct patterns and legacy workarounds — queries written to compensate for schema design decisions that have since been corrected. An agent retrieving historical context without curation may inherit the habits of analysts who were working around database limitations, amplifying errors rather than correcting them.

Still, the core insight — that the hallucination problem in database agents is fundamentally a context problem, not a capability problem — offers a productive reframe. For organizations evaluating autonomous AI systems for data work, the relevant engineering question may not be which model to choose, but how to surface the institutional knowledge their existing query logs already contain.

This article was drafted from a single primary source. Monexus will update as additional benchmarks and independent replications become available.