My accent costs me 30 IQ points on Zoom. So we built an ML model to fix it

artavazdsm 33 points 23 comments March 03, 2026

Discussion Highlights (20 comments)

artavazdsm

Co-founder of Krisp here. 1.5B non-native English speakers in the workforce, 4x native — yet all comms infra is optimized for native accents. We spent 3 years building listener-side, on-device accent understanding. The hard parts: no parallel training data exists, the accent space is infinite, accent is entangled with voice identity, and it runs on CPU under 250ms latency. Built in Yerevan, Armenia. Beta is live and free. Happy to go deep on the ML side.

astipili

will it help the barista in Starbucks get my name right finally?

snek26

Curious whether wav2vec-style embeddings played a role in your representation learning.

bebelovejan

I would like to use such model but only if it really preserves my voice, otherwise people would understand its not me or I have to use it all the time.

imuradyan

On-device CPU inference is the real flex here! Optimization probably mattered as much as modeling.

sssnowgirl

This is a game-changer! I remember each and every call I had with an investor and feeling shy asking "can you repeat?"... thanks krisp, you changed my life!!!

gyumjibashyan

How did you estimate the number of IQ points?

arshakarap

This is built for international, privacy-first teams!

armsuro

This feels adjacent to voice conversion research, but with stricter latency constraints.

amartiro

The parallel data is a problem here — you can’t crowdsource ground truth because no one can record themselves with a different accent.

aharutyunyan

Accent space is effectively infinite. Generalization must rely on invariants rather than enumeration.

nareksardaryann

Great work. Natural + clear is the combo that matters.

rasjonell

Latency can destroy conversational rhythm. What’s your p95 inference time? also are there any benchmarks we can see?

tritont

Nice to finally see this direction of accent conversion (that is on incoming calls) in the Krisp app. This is a very meaningful feature.

Ani_Kh1

Curious whether wav2vec-style embeddings played a role in your representation learning.

MarAraqelyan

Really cool to see accent adaptation in real time — curious about benchmarks and how well this handles messy, real Zoom calls

achobanyan

Local CPU inference stands out. Careful optimization likely rivaled the modeling effort.

zkhalapyan

Yeh, this would be helpful for the Singlish friends of mine out there!

1ilit

On-device CPU inference is the real flex here. Optimization probably mattered as much as modeling.

Narek21

This feels adjacent to voice conversion research, but with stricter latency constraints.

My accent costs me 30 IQ points on Zoom. So we built an ML model to fix it

Discussion Highlights (20 comments)

Related Discussions