Data Documentation Is AI Infrastructure

Engineering

April 9, 2026

6 min

An experiment quantifies how data documentation drives consistent AI analytics outcomes across natural language queries. It shows documentation has a far greater impact than prompt phrasing, especially as query complexity increases.

[

Alex Mentch,

]

Consistent AI analytics starts with your data dictionary

We ran an experiment to quantify something most data teams already suspect: documentation matters a lot for AI-powered analytics. Here's what we found and how much it matters.

Most people building natural language-to-SQL tools have an intuition that documentation helps. We wanted to put a number on it.

At TRM Labs, we're building AI-powered analytics tooling for blockchain data. When an analyst asks a question in natural language and gets data back, the answer needs to be the same regardless of how they phrase it or which agent instance handles the request. So we designed an experiment to measure how much table documentation affects consistency, and whether other factors like prompt phrasing matter as much as we expected.

The short version: documentation had a massive effect. The other things we tested mattered less than we thought.

What we tested

We spawned isolated Claude Sonnet agents and gave each one the same analytics question against our BigQuery data warehouse. Each agent was a completely independent session with no shared context and no memory of what other agents did. Every agent had full tool access: it could discover table schemas, write SQL queries, execute them, inspect results, and iterate until satisfied.

We varied two things:

Documentation: Some agents received column-level schema documentation explaining what each table and field means. Others got only the dataset names and had to figure everything out through tool use.
Prompt phrasing: Some agents received identical prompts. Others received semantically equivalent rephrasings, meaning the same question worded 20 different ways.

We tested at two difficulty levels — a straightforward aggregation and a multi-filter analytical query — with 100 agents per scenario. Then we measured numeric agreement: did independent agents return the same numbers?

What we found

Scenario	Simple Query	Complex Query
Documented, same prompt	100%	99%
Documented, rephrased	99%	62%
Undocumented, same prompt	36%	15%
Undocumented, rephrased	47%	16%

With documentation, every agent returned identical numbers on the simple question — 100% agreement despite writing syntactically different SQL. On the complex question, 99 out of 100 agreed.

Without documentation, agreement dropped sharply even when agents received the exact same prompt. Agents could discover and inspect table schemas on their own, but they interpreted the same columns differently. One agent would count page visits, another would count rendered views, a third would count tracked events. All reasonable interpretations of "daily active users", and all producing different numbers.

One result we didn't expect: for undocumented tables, rephrased prompts actually produced higher consistency than the canonical prompt (47% vs 36% on the simple query). Different phrasings of the same question act as different search strategies for schema discovery. Some wordings happen to contain keywords that map well to table names, guiding agents more reliably to the right data.

Prompt phrasing mattered less than we expected for simple questions. Documented agents held at 99% across 20 different wordings. For complex queries, though, rephrasings caused a meaningful drop, from 99% down to 62%. When a question involves multiple filters and implicit assumptions, small differences in wording led agents to make different choices about which filters to apply.

Why documentation has such a large effect

The underlying issue is that schema discovery and documentation answer different questions. Discovering a schema tells you what columns exist. Documentation tells you what they mean and how they should be used.

An agent can see that a table has incoming_volume and outgoing_volume columns. Without documentation, it has to guess which one answers a question about "flow from A to B". Is that outgoing from A's perspective, or from B's? Both are plausible.

We saw this play out at every level. Without documentation, agents couldn't consistently agree on which table to query (38-74% table agreement vs. 96-100% with docs). And even when undocumented agents happened to pick the same table, they still disagreed on filters and aggregation logic.

This is the same kind of ambiguity that trips up human analysts joining a new team. The difference is that AI operates at scale, so the ambiguity shows up as inconsistent results across users rather than a Slack thread asking "Hey, which table should I use for this?"

Practical takeaways

Document your data layer before pointing AI at it. This probably isn't surprising, but the magnitude might be. We went from 36% to 100% agreement on simple queries just by adding column-level descriptions. Even basic documentation makes a large difference: what a metric means, what default filters should be applied, which table to use for which type of question.

Don't over-invest in prompt engineering for simple questions. For straightforward aggregations, documentation quality matters much more than how the question is phrased. Twenty different wordings produced 99% agreement when the tables were well-documented.

For complex queries, have the agent ask instead of guess. The 62% agreement on complex questions came from agents silently making different assumptions. Is "flow from mixers to exchanges" the outgoing volume from the mixer's perspective, or the incoming volume at the exchange? Both are defensible, and both produce different numbers. An agent that pauses to ask "which direction do you mean?" avoids this entirely. It's what a good analyst would do when a question is ambiguous.

Ensemble approaches could work for the remaining variance. Even at 62% individual agreement on complex rephrased questions, running a small group of agents and taking the majority answer pushes effective reliability to around 85%. It's a practical pattern for queries where consistency matters most.

Wrapping up

None of this is particularly surprising on its own. Most data teams would guess that documentation helps AI tools work better. What we found useful was quantifying the effect and seeing where the breakdowns actually happen. The gap between documented and undocumented is large enough that it changes how we prioritize work: writing good data documentation isn't just a nice-to-have for onboarding new analysts, it's a prerequisite for reliable AI-powered analytics.

We're continuing to run experiments like this at TRM as we build out our AI tooling. If you're working on similar problems, we'd be curious to hear what you're finding.

[

Alex Mentch,

]