Disagreement among frontier LLMs on real-world fact-checks

kostaj 486 points 341 comments May 28, 2026

Discussion Highlights (20 comments)

kostaj

Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement. Quick context on what's in the writeup and what isn't: - What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality. - What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y." - Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set. - Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier. - Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference. Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain. Permanent archive: https://doi.org/10.5281/zenodo.20344847

christophilus

They get more human by the day.

ipunchghosts

I think ppl only care about how Claude or codex does.

spacebacon

And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect

embedding-shape

> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform. Cool. I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.

simonw

Here's the prompt they used: Classify this claim as of <date>: "<atomic claim>" Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers. The claims look like this: https://lenz.io/research/llm-disagreement/data.csv I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil... The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading". I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt. The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them". [ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ] The prompt lacks any kind of rubric to clarify how those terms should be applied. As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models. Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected." The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing. Update 2: a much better example: "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option. The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

apples_oranges

That's better than all agreeing on the wrong answer, however.

f_devd

Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%

andai

This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.

bobosmrad

looking at the claims i would say 5 humans would disagree even more than the llms some of the claims where llms disagree: "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." "The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928â€“1930)." "Neptune Deep will start delivering natural gas in 2027." "A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'." "Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."

proofofcontempt

What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance

throw310822

Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"? The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?

bayarearefugee

(Brought to you by) Lenz...? a crummy commercial...? ...son of a bitch

Razengan

Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025 It said the airport code didn't exist I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.

rastrojero2000

Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.

utopiah

Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything". PS: yes, I might or might not have a degree in corporate strategy & PR.

thegrim33

"None of these claims is older than February 15, 2026" All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.

alvis

The problem is that it's testing claims (or some people would prefer calling them "truths") without much context. Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides` While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most. Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?

fergie

Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.

cm2187

Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.

Disagreement among frontier LLMs on real-world fact-checks

Discussion Highlights (20 comments)

Related Discussions