TADA: Speech generation through text-acoustic synchronization

smusamashah 97 points 25 comments March 11, 2026
www.hume.ai · View on Hacker News

Discussion Highlights (8 comments)

OutOfHere

Will this run on CPU? (as opposed to GPU)

qinqiang201

Could it run on Macbook? Just on GPU device?

earthnail

I don’t understand the approach > TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model. So basically just concatenating the audio vectors without compression or discretization? I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

microtherion

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on. The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

mpalmer

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

tcbrah

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

ilaksh

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context? Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough? I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

kavalg

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt. https://huggingface.co/HumeAI/tada-3b-ml https://github.com/HumeAI/tada

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed