Show HN: Spec27 – Spec-driven validation for AI agents

njyx 13 points 9 comments April 30, 2026
www.spec27.ai · View on Hacker News

Hi HN! We’re a team of ML validation specialists and we’ve been building /Spec27, a tool for testing whether AI agents still do their job safely and reliably as models, prompts, tools, and surrounding systems change. We started working on this because a lot of current LLM evaluation work seems aimed at scoring general model behavior, while many teams are deploying systems that have a specific mission to fulfill. Many of the tools also assume you have full access to the agent stack and traces so you can place SDKs and Gateways, but a lot of agents are being created on vendor platforms where this isn’t possible. As a result, we approaches it from the outside in: all tests just run to the primary interfaces of an Agent and don’t assume anything about internals. The other important things about the approach is spec-driven. Instead of treating testing as a one-off benchmark or static eval set, we let teams define reusable specifications for the behavior they want from an agent, then generate tests against those specs. With this you can automatically generate adversarial and robustness checks, so you can see what an agent is sensitive to and what kinds of changes cause it to fail. We’ve worked on validation for other AI systems before, including vision and tabular workflows, and /Spec27 is our new product for language-model-based agents. Currently in early access, so we’d love feedback! The current version is strongest for single-turn agent and application validation. We do not fully support multi-turn interactions yet, and better telemetry/tool-call integration is still on our roadmap. We’ve made the product open to try for HN readers, with a sample flow so it’s easy to poke around without much setup. We’d especially love feedback from people deploying internal agents, vendor agents, or other AI systems where reliability matters more than benchmark scores.

Discussion Highlights (5 comments)

_mikz

Hey! Michal from the engineering team behind here. There are some painful experiences from the journey - async in Django, background processing in Python, scaling agent workflows with growing codebase. Happy to talk!

jovanca_

Hi! Jovanca from Spec27 team here. We started building this because agent safety/validation still feels pretty undercooked in practice. Interested in how people here think about it :D

chesh

I get so mad when responses from chat agents hallucinate. If this can rebuild trust in the results I will give Spec27 a try

eloycoto

I really like the judge from here: https://docs.spec27.ai/docs/guides/judges I didn't see any example of the full flow, do you have anything that I can see/explore?

Aniloid2

Hey, I’m Brian from Research at Spec27. I’ve been working on some of the adversarial robustness techniques in the backend and am currently working on the multi-turn extension. I’d be happy to talk about what I’ve learned and hear any suggestions!

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed