AI Models Disagree on Basic Facts Two-Thirds of the Time, Study Finds

A study published on 29 May 2026 found that five leading AI models reached conflicting conclusions on roughly two-thirds of a battery of real-world factual claims, raising pointed questions about the reliability of automated fact-checking tools now entering newsrooms, legal workflows, and government services.
The research, first reported by Decrypt, presented each model with 1,000 claims drawn from verifiable public records—dates, figures, legal precedents, and documented historical events. The models agreed with one another in only 33 percent of cases. The finding cuts against industry assurances that the latest generation of AI systems has moved beyond the hallucination-prone earlier eras of large language models.
The disagreement rate was not random noise. Researchers identified systematic patterns: models diverged most sharply on claims requiring contextual judgment—distinguishing primary sources from secondary ones, weighing competing official accounts, or parsing ambiguous phrasing. Factual claims that a human fact-checker might resolve in minutes produced wildly different outputs depending on which model was queried.
The fact-checking automation question
The findings arrive as major technology platforms and media organisations have begun deploying AI systems to handle first-pass fact-checking at scale. The logic is straightforward: human fact-checkers are expensive and slow; models can process thousands of claims per hour. Several wire services have integrated AI-assisted verification into their publishing pipelines, and at least three Western governments have piloted AI tools for flagging misinformation in public communications.
The study does not address whether any single model performed better or worse overall—a limitation the authors acknowledge. What it documents is inconsistency: the technology as a category lacks the reliable output that deployment at institutional scale would seem to require. An AI system that produces conflicting verdicts depending on which underlying model is used cannot serve as a stable anchor for editorial or legal decisions.
The counterargument from AI developers is that the models tested represent a snapshot of a fast-moving technology. Capabilities improve with each generation, and newer systems already in development may narrow the disagreement gap. Industry representatives have pointed to benchmark improvements on standardized tests as evidence of accelerating reliability. Those benchmarks, critics note, are curated environments quite different from the messy, ambiguous factual terrain of real-world claims.
What the disagreement reveals
The study's design exposed something fundamental about how current AI systems handle factual uncertainty. Large language models are trained to produce probable continuations of text, not to retrieve verified propositions from a stable ground truth. When presented with a factual claim, the model generates a response shaped by patterns in its training data—which can include contested accounts, outdated information, and conflicting official versions of the same events.
This is distinct from the hallucination problem that dominated earlier AI discourse. Hallucinations are confident false statements. The disagreement documented in this study is more subtle: models are producing different outputs not because one is hallucinating but because the boundary between "fact" and "interpretation" is genuinely contested in the source material. An AI system cannot reliably adjudicate competing official accounts if it has no independent mechanism for weighing evidence.
The implications extend beyond newsrooms. Courts in multiple jurisdictions have begun receiving AI-assisted filings, and legal technology vendors market tools that claim to verify factual claims in case law. If the underlying models cannot agree on what established facts say, the risk of compounding error through automated legal research is substantial.
The infrastructure question
The study was not designed to name which specific models produced which outputs, a decision researchers said was intended to prevent competitive distortion of the results. What is clear is that the set tested represents the current frontier—systems marketed as state-of-the-art for reasoning and factual tasks. Their widespread commercial availability makes the disagreement rate particularly consequential.
Enterprises building workflows around AI fact-checking face a structural problem: there is no external verification layer that can authoritatively arbitrate which model's output is correct. The technology currently lacks the equivalent of a peer-review process for factual claims. That absence is not merely a technical gap; it is a governance one. Deploying a tool whose outputs are inconsistent across providers, with no mechanism to determine which output is accurate, transfers risk from a known human source to an unknown algorithmic one.
Stakes and what comes next
The study stops short of recommending a moratorium on AI fact-checking deployments, noting that the alternative—purely manual fact-checking—carries its own costs and delays. What it establishes is a performance ceiling that the industry has not yet demonstrably cleared.
The pressure to deploy is real and growing. News organisations face compression in fact-checking resources even as content production accelerates. Governments processing high-volume public communications need automated tools. The commercial incentive to claim AI reliability is powerful, and the study suggests that those claims should be received with scepticism until the technology can demonstrate consistent agreement on verifiable ground truth.
The broader question is whether the disagreement documented in this study represents a transient phase in AI development or a structural feature of how large language models process factual claims. The answer will determine whether AI fact-checking becomes a reliable institutional tool or remains a promising technology whose deployment outpaces its demonstrated capability.
This desk reported on the Decrypt study of 29 May 2026, which represents the most systematic cross-model comparison of factual claims published to date. Monexus will continue to follow AI reliability research as deployment scales across institutional contexts.