TADA: Speech generation through text-acoustic synchronization
smusamashah
97 points
25 comments
March 11, 2026
Related Discussions
Found 5 related stories in 59.6ms across 3,471 title embeddings via pgvector HNSW
- Cohere Transcribe: Speech Recognition gmays · 177 pts · March 31, 2026 · 53% similar
- Show HN: Free audiobooks with synchronized text for language learning floo · 11 pts · March 11, 2026 · 52% similar
- Show HN: Audio Toolkit for Agents stevehiehn · 55 pts · March 01, 2026 · 50% similar
- Speaking of Voxtral Palmik · 18 pts · March 26, 2026 · 50% similar
- Show HN: Audio-to-Video with LTX-2 runshouse · 22 pts · March 02, 2026 · 49% similar
Discussion Highlights (8 comments)
OutOfHere
Will this run on CPU? (as opposed to GPU)
qinqiang201
Could it run on Macbook? Just on GPU device?
earthnail
I don’t understand the approach > TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model. So basically just concatenating the audio vectors without compression or discretization? I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
microtherion
To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on. The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.
mpalmer
"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.
tcbrah
the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO
ilaksh
okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context? Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough? I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.
kavalg
MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt. https://huggingface.co/HumeAI/tada-3b-ml https://github.com/HumeAI/tada