Tools like ChatGPT, Claude, and Gemini are quickly becoming a new “front door” to information. For those working to democratize access to high-quality data, this is a real opportunity: Large language models (LLMs) can help people more easily derive insights from public data through tools they’re already using. But how well do LLMs handle questions grounded in public data? To find out, we tested them.

We curated 100 questions written from the perspectives of parents, advocates, researchers, data analysts, Capitol Hill staffers, and education or career advisors, across 10 education and workforce topics, to ask the LLMs (GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Flash). Each prompt was sent as a standalone query with no custom system prompt, conversation history, or tool access beyond what each model provides by default, approximating the out-of-the-box experience a typical user would have had in early 2026.

From this exercise, we found four significant limitations in current LLM performance:

  • Models explained concepts well but struggled to retrieve and use specific data.
  • Incorrect information was difficult to detect.
  • Models answered different questions than the ones asked.
  • Pointing models to the right sources and tools didn’t improve results.

These findings aren’t unique to education and workforce pathway data. Any organization that stewards public data—such as integrated data systems, federal statistical programs, or city open data portals—likely faces similar gaps. As AI reshapes how people find and use public information, the quality of what these systems deliver depends on the data infrastructure underneath them. Ensuring the infrastructure is AI ready will shape whether AI expands access to trustworthy public data or erodes it.

How Well Do LLMs Handle Public Data Questions?

Our evaluation of LLM behavior when prompted with queries whose answers rely on public education and workforce data raised four concerns for the responses’ accuracy and usefulness.

Models explained concepts well but struggled to retrieve and use specific data.

When asked to define a term or broadly describe where to find data, models often produced accurate, useful responses. But the quality dropped sharply when questions required retrieving specific data, like enrollment at a particular school or earnings for graduates of a specific program. The quality dropped further still for tasks that required generating code or artifacts.

Read the full article about AI's answers to public data questions by Erika Tyagi, Kristin Blagg, and Emily Gutierrez at Urban Institute.