Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

darkrishabh 68 points 34 comments May 07, 2026
github.com · View on Hacker News

Discussion Highlights (9 comments)

egeozcan

Are there any published results gathered using this?

ssgodderidge

The example model in the documentation is 4o-mini, you might want to update that to a more recent model. As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

ianhxu

How do you iterate on the judge prompt? Is there an auto rater?

ChairmanLmao

Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.

hiroto_lemon

having token counts surface on each side in the report would be super useful

reedlaw

I'm skeptical skills will outperform training given that Opus 4.7 already ignores a 720-byte CLAUDE.md telling it to use tidewave (a Rails MCP server with 6 tools) for db queries. When I asked a new claude session about a record it called > Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok") even though I have in CLAUDE.md: > For database queries, use tidewave first. I then prompted: > use tidewave as per CLAUDE.md. also diagnose why you failed to heed that > ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately. If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave. While LLM judging could be helpful, I think the tool-call assertions ( https://github.com/darkrishabh/agent-skills-eval#what-you-ge... ) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.

scosman

Why so narrowly eval just with/without skill? Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?

TheGRS

This is all still really early stuff, but there was a blog yesterday that got me thinking we need a way to send telemetry data for work being done by agents out to a central agent the org controls. It would be responsible for creating skills based on the work people are doing - or in other words the stuff they're correcting the agents on. And then you could develop skills for an entire department (customer service, engineering, marketing, etc). This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.

codecheers

With-skill vs without-skill evals are useful, but what about comparing skills against each other? Is there an emerging standard for saying one Skill is better than another, beyond custom pass/fail evals?

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed