Show HN: Open-source playground to red-team AI agents with exploits published
We build runtime security for AI agents. The playground started as an internal tool that we used to test our own guardrails. But we kept finding the same types of vulnerabilities because we think about attacks a certain way. At some point you need people who don't think like you. So we open-sourced it. Each challenge is a live agent with real tools and a published system prompt. Whenever a challenge is over, the full winning conversation transcript and guardrail logs get documented publicly. Building the general-purpose agent itself was probably the most fun part. Getting it to reliably use tools, stay in character, and follow instructions while still being useful is harder than it sounds. That alone reminded us how early we all are in understanding and deploying these systems at scale. First challenge was to get an agent to call a tool it's been told to never call. Someone got through in around 60 seconds without ever asking for the secret directly (which taught us a lot). Next challenge is focused on data exfiltration with harder defences: https://playground.fabraix.com
Discussion Highlights (2 comments)
hellocr7
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
Mooshux
Good timing on this. Red-teaming agents pre-production is underrated and most teams skip it entirely. One thing that keeps coming up: even when red-teaming surfaces credential exfiltration vectors, the fix is usually reactive (rotate the key, patch the prompt). The more durable approach is limiting what the credential can do in the first place. Scoped per-agent keys mean a successful attack through one of these exploits can only reach what that agent was authorized to touch. The exfiltration path exists, but the payload is bounded. We built around this pattern: https://www.apistronghold.com/blog/stop-giving-ai-agents-you...