Cohere Transcribe: Speech Recognition
gmays
177 points
54 comments
March 31, 2026
Related Discussions
Found 5 related stories in 49.9ms across 3,471 title embeddings via pgvector HNSW
- TADA: Speech generation through text-acoustic synchronization smusamashah · 97 pts · March 11, 2026 · 53% similar
- Show HN: Free audiobooks with synchronized text for language learning floo · 11 pts · March 11, 2026 · 52% similar
- Show HN: IceCubes – speaker-attributed meeting transcripts without a bot Nandita_Arora · 11 pts · March 09, 2026 · 51% similar
- Show HN: Hyper – A stupidly non-corporate voice AI app for IRL conversations shainvs · 16 pts · March 11, 2026 · 48% similar
- Kagi Translate now supports LinkedIn Speak as an output language smitec · 122 pts · March 17, 2026 · 47% similar
Discussion Highlights (16 comments)
geooff_
I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic. It has the most crisp, steady P50 of any external service I've used in a long time.
simonw
It's great that this is Apache 2.0 licensed - several of Cohere's other models are licensed free for non-commercial use only.
dinakernel
My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive. In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.
topazas
How hard could it be to train other European language(-s)?
teach
Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?
gruez
> Limitations >Timestamps/Speaker diarization. The model does not feature either of these. What a shame. Is whisperx still the best choice if you want timestamps/diarization?
Void_
Just today I shipped support for this in Whisper Memos: https://whispermemos.com/changelog/2026-04-cohere-transcribe Accurate and fast model, very happy with it so far!
ramon156
I had to set-up fireflies for our company recently. Cool tool, but I'm sending dozens of internal meetings to an american company. Our ISO inspector wouldn't be pleased to know. This is a good option. Will check it out.
stavros
To clarify, this is SOTA in its size category, right? It's not better than Parakeet, for example?
kalmuraee
Multimodels are way better
_medihack_
Unfortunately, this model does not seem to support a custom vocabulary, word boosting or an additional prompt.
kieloo
The problem with many STT models is that they seem to mostly be trained on perfectly-accented speech and struggle a lot with foreign accents so I’m curious to try this one as a Frenchman with a rather French English accent. So far, the best I have found while testing models for my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.
BreezyBadger
Awesome. Going to see if I can port https://scrivvy.ai to this. based in Canada
bkitano19
notable omission of deepgram models in comparisons?
mnbbrown
Ran it over our internal dataset of ~250 recordings of people saying british postcodes (all kinds of accents, etc) - it's competitive for sure! Soniox (stt-async-v4): 176/248 (71.0%) ElevenLabs (scribe_v2): 170/248 (68.5%) AssemblyAI (universal-3-pro): 166/248 (66.9%) Deepgram (nova-3): 158/248 (63.7%) AssemblyAI (universal-2): 148/248 (59.7%) Cohere (transcribe-03-2026): 148/248 (59.7%) Speechmatics (enhanced): 134/248 (54.0%) P.s. how do I get this to render correctly on here?
nodja
It's probably another ASR model that focuses on benchmarks and simple uses instead of more challenging real use cases. I upload edited gameplay vods of twitch streams on youtube, and use whisper-large-v3 to provide subtitles for accessibility reasons (youtube's own auto-subtitles suck, tho they've been getting better). My checklist for a good ASR model for my use case is: 1. Have timestamp support. 2. Support overlapping speakers. 3. Accurate transcripts that don't coalesce half words/interrupted sentences. 4. Support non verbal stuff like [coughs], [groans], [laughs], [sighs], etc. 5. Allow context injection of non-trivial sizes (10k+ words) 1 is obvious because without it we can't have subtitles. Force alignment fails too often. 2 is crucial for real world scenarios because in the real world people talk over each other all the time, in my case it's a streamer talking over gameplay audio, or when the streamer has guests over. When 2 people speak the transcript either ignores one of them, or in the worst case, both of them. 3 and 4 are an accessibility thing, if you're deaf or hard of hearing having a more literal transcript of what's being said conveys better how the speaker is speaking. If all subtitles are properly "spell-checked" then it's clear your model is overfit to the benchmarks. 5 Is not a requirement per se, but more of a nice to have. In my use cause the streamer is often reading stream chat so feeding the model the list of users that recently talked, recent chat messages, text on screen, etc. Would make for more accurate transcripts. I've tried many models, and the closest that fulfill my needs are LLM style models on top of forced alignment. It's too slow, so I've been sticky with whisper because with whisperx I can get a transcript in 5 minutes with just a single command. One thing all these models do (including whisper) is just omit full sentences, it's the worst thing a model can do.