"Disregard That" Attacks

leontrolski 47 points 26 comments March 25, 2026

Discussion Highlights (7 comments)

lmm

The bowdlerisation of today's internet continues to annoy me. To be clear, the joke is traditionally "HAHA DISREGARD THAT, I SUCK COCKS".

arijun

I mean, no security is perfect, it's just trying to be "good enough" (where "good enough" varies by application). If you've ever downloaded and used a package using pip or npm and used it without poring over every line of code, you've opened yourself up to an attack. I will keep doing that for my personal projects, though. I think the question is, how much risk is involved and how much do those mitigating methods reduce it? And with that, we can figure out what applications it is appropriate for.

wenldev

I think a big part of mitigating this will probably be requiring multiple agents to think and achieve consensus before significant actions. Like planes with multiple engines

stingraycharles

I didn’t see the article talk specifically about this, or at least not in enough detail, but isn’t the de-facto standard mitigation for this to use guardrails which lets some other LLM that has been specifically tuned for these kind of things evaluate the safety of the content to be injected? There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive. Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.

simojo

Today I scheduled a dentist appointment over the phone with an LLM. At the end of the call, I prompted it with various math problems, all of which it answered before politely reminding me that it would prefer to help me with "all things dental." It did get me thinking the extent to which I could bypass the original prompt and use someone else's tokens for free.

marcus_holmes

The hypothetical approach I've heard of is to have two context windows, one trusted and one untrusted (usually phrased as separating the system prompt and the user prompt). I don't know enough about LLM training or architecture to know if this is actually possible, though. Anyone care to comment?

kstenerud

There are two primary issues to solve: 1: Protecting against bad things (prompt injections, overeager agents, etc) 2: Containing the blast radius (preventing agents from even reaching sensitive things) The companies building the agents make a best-effort attempt against #1 (guardrails, permissions, etc), and nothing against #2. It's why I use https://github.com/kstenerud/yoloai for everything now.

"Disregard That" Attacks

Discussion Highlights (7 comments)

Related Discussions