TADA: Speech generation through text-acoustic synchronization

smusamashah 97 points 25 comments March 11, 2026

Discussion Highlights (8 comments)

OutOfHere

Will this run on CPU? (as opposed to GPU)

qinqiang201

Could it run on Macbook? Just on GPU device?

earthnail

I don’t understand the approach > TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model. So basically just concatenating the audio vectors without compression or discretization? I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

microtherion

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on. The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

mpalmer

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

tcbrah

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

ilaksh

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context? Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough? I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

kavalg

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt. https://huggingface.co/HumeAI/tada-3b-ml https://github.com/HumeAI/tada

TADA: Speech generation through text-acoustic synchronization

Discussion Highlights (8 comments)

Related Discussions