Voice AI Systems Are Vulnerable to Hidden Audio Attacks
SVI
114 points
30 comments
May 18, 2026
Related Discussions
Found 5 related stories in 88.0ms across 8,303 title embeddings via pgvector HNSW
- How OpenAI delivers low-latency voice AI at scale Sean-Der · 359 pts · May 04, 2026 · 55% similar
- Speed Matters: Why AI Software Vulnerability Exploitation is going be bad randersson1000 · 11 pts · April 22, 2026 · 55% similar
- Proprietary Software, Hardware and Protocols Face AI-Driven Security Risk dwitcher · 11 pts · May 06, 2026 · 55% similar
- Google says criminal hackers used AI to find a major software flaw donohoe · 151 pts · May 11, 2026 · 53% similar
- 4TB of voice samples just stolen from 40k AI contractors at Mercor Oravys · 494 pts · April 27, 2026 · 53% similar
Discussion Highlights (10 comments)
naveenraj-17
I believe that will be purely based on how the AI Models stored the voices in their neural networks. If we can debug that, then we would be able to send a secret sounnd a AI model might be able to understand due to it's internat connections, but that doesn't make sense to us. Until then, there's no harm, is what my view is
leonulicnik
Does this transfer to Whisper / CLAP-type audio models or is it ASR-decoder specific? Whisper would be intresting given how widely it's used in prod.
nine_k
Isn't it the "adversarial image" attack, well-known in (earlier) visual recognition models [1]? That would be a quite obvious vector. [1]: https://www.science.org/content/article/turtle-or-rifle-hack...
wutwutwat
Related: Benn Jordon shows how to poison pill AI harvesting music for training The Art Of Poison-Pilling Music Files https://www.youtube.com/watch?v=xMYm2d9bmEA
moffkalast
Phreaking is back on the menu, boys.
JoblessWonder
This isn't fully "Hidden" but I've always wondered if Ai scraping is the reason why short form videos on Youtube/TikTok/Instagram featuring film/tv clips will sometimes have 2 audio tracks... one with the actual audio from the clip a little louder and one audio track with a computer generated narrator providing running commentary of what is happening and why. As a human I'm able to tune it out but it is very weird/jarring. In case anyone hasn't had the displeasure of viewing these I'll link some in a comment below once I scroll through my feed and find one.
1minusp
Bene-gesserit have entered the chat!
juvoly
> "Audio modality is really challenging to comprehend because of how limited our hearing is" Would it help to significantly lower the hearing capabilities of the AI system? At Juvoly, we always encouraged GPs to invest in high quality microphone like Jabra Speak, connected through USB. A good mic results in much better audio transcriptions, but maybe that was all for the wrong reasons?
joshstrange
I'd like to commend Apple on being ahead of the curve with this kind of attack, I don't think Siri is susceptible to this at all. Mostly due to it not being able to hear/understand what I say in the best of times /s It's insane to me how much of a nose-dive Siri or any Apple-based STT takes when there is _any_ noise in the background. I like to play music at low levels in my house just as background noise and I've noticed that if I'm playing any music my STT just goes to complete shit (often missing the last 2-3+ words and messing up things in the middle). On the other hand, in the exact same environment, Parakeet v3 (via MacWhisper) has zero issues even with background noise.
ericmcer
Isn't this an attack on transcribers? Not on "Voice AI systems". ASR transcribers predate LLMs and all the AI hype. If you are transcribing audio from unknown sources and feeding the output to agents that can perform authorized actions on your behalf you are kind of screwed anyway. I guess it would be dangerous if you tricked authorized users to play the sounds in the background while transcribing something.