Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs
darkrishabh
68 points
34 comments
May 07, 2026
Related Discussions
Found 5 related stories in 90.8ms across 8,303 title embeddings via pgvector HNSW
- Show HN: A prompt that builds the most capable AI agent system fainir · 12 pts · March 28, 2026 · 63% similar
- Show HN: Skrun – Deploy any agent skill as an API frizull · 50 pts · April 08, 2026 · 62% similar
- Show HN: Real-time observability for coding agents vtemian · 14 pts · March 17, 2026 · 60% similar
- Show HN: Understudy – Teach a desktop agent by demonstrating a task once bayes-song · 96 pts · March 12, 2026 · 59% similar
- Show HN: 49Agents – Infinite canvas IDE for AI agents alpadurza · 16 pts · April 28, 2026 · 58% similar
Discussion Highlights (9 comments)
egeozcan
Are there any published results gathered using this?
ssgodderidge
The example model in the documentation is 4o-mini, you might want to update that to a more recent model. As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?
ianhxu
How do you iterate on the judge prompt? Is there an auto rater?
ChairmanLmao
Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.
hiroto_lemon
having token counts surface on each side in the report would be super useful
reedlaw
I'm skeptical skills will outperform training given that Opus 4.7 already ignores a 720-byte CLAUDE.md telling it to use tidewave (a Rails MCP server with 6 tools) for db queries. When I asked a new claude session about a record it called > Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok") even though I have in CLAUDE.md: > For database queries, use tidewave first. I then prompted: > use tidewave as per CLAUDE.md. also diagnose why you failed to heed that > ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately. If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave. While LLM judging could be helpful, I think the tool-call assertions ( https://github.com/darkrishabh/agent-skills-eval#what-you-ge... ) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.
scosman
Why so narrowly eval just with/without skill? Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?
TheGRS
This is all still really early stuff, but there was a blog yesterday that got me thinking we need a way to send telemetry data for work being done by agents out to a central agent the org controls. It would be responsible for creating skills based on the work people are doing - or in other words the stuff they're correcting the agents on. And then you could develop skills for an entire department (customer service, engineering, marketing, etc). This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.
codecheers
With-skill vs without-skill evals are useful, but what about comparing skills against each other? Is there an emerging standard for saying one Skill is better than another, beyond custom pass/fail evals?