Agent Reading Test

kaycebasques 60 points 18 comments April 06, 2026
agentreadingtest.com · View on Hacker News

https://dacharycarey.com/2026/04/06/designing-agent-reading-...

Discussion Highlights (10 comments)

kaycebasques

See also https://dacharycarey.com/2026/04/06/designing-agent-reading-...

dostick

The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.

massimoto

Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn't run) output. Claude Web Opus 4.6 Extended: 14 / 20 points x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma

theyCallMeSwift

I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative). Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output. Any thoughts on how you'd change the test structure with this in mind?

numeri

11/20 for qwen/qwen3.5-flash-02-23 in Claude Code, with effort set to low.

throwatdem12311

What a great target for someone to hack and add some secret prompt injections into.

refulgentis

You're doing gods work, thanks. (there's a lot of shitty agents and more to come) (and I'm a lot more confident in my impl now, 17/20)

lucb1e

I don't understand. It says for the first task: > URL: <https://...docs...> What parameters does the Create Stream endpoint accept? The answer that I would give is `name`, `description`, `retention_days`, and `tags`. What the answer sheet < https://agentreadingtest.com/answers.json > has is: `CANARY-TRUNC-10K-fox` ("Early in the page. All agents should find this."), `CANARY-TRUNC-40K-river`, `CANARY-TRUNC-75K-summit`, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page? How is this test supposed to work?

hettygreen

At this point I wonder if AI's get updated just to recognize and deal with specific tests like this. In comparison to solving the root issues, it's gotta be easier to add a few extra lines of code to intervene if someone is asking about walking or driving to the carwash or wanting to know how many "r"'s in the word strawberry. I wonder if AI is the opaque interesting tech it says it is, but also it's thousands of extra if statements catching known/published/problematic/embarrassing inconsistencies. Anyone here work for any of the big AI companies? Is it just one big black-box, or a black-box with thousands of intervention points and guard rails?

psychomfa_tiger

The Hawthorne effect finding is the part that stuck with me. Agents performing differently just because the framing suggests evaluation tracks with what I see daily, rephrase the same task slightly and you get measurably different output quality. Redesigning from /tests/ to /tasks/ and "we're testing the docs not you" is borrowed straight from usability research and it's funny that it works on LLMs too. The pipeline vs agent distinction is also something I hit all the time but never articulated that cleanly. Agent says it followed a redirect when it actually just manually fetched the target.

Semantic search powered by Rivestack pgvector
3,752 stories · 35,056 chunks indexed