Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus

jmarbach 73 points 20 comments April 16, 2026
medium.com · View on Hacker News

Full disclosure - I formerly worked for Grafana Labs. The size of this Grafana Mimir deployment would rank it in the top echelon of customers. The irony is that this may be a $0 revenue user for Grafana Labs.

Discussion Highlights (7 comments)

dig1

> The irony is that this may be a $0 revenue user for Grafana Labs. Why is that ironic? Since Mimir is open-source, $0 revenue users are expected. AFAIK, Grafana Labs relies heavily on go, typescript, and linux, without necessarily being their top financial contributor. They could have kept Mimir proprietary like Splunk, but whether that would have attracted the same level of adoption or community contribution is another matter.

awoimbee

Directly emitting metrics using OTLP instead of having the OTel receiver scrape the metrics endpoint is interesting. I never made that move because the Prometheus metrics endpoint works and is so simple, and it's what most projects (eg kubernetes) use.

jameson

Curious why the team choose Grafana Mirmir over VM cluster?

codeduck

> given Prometheus’s widespread adoption and proven reliability in diverse environments. I have used Prometheus a lot. Reliable is not a word I would associate with it.

blueybingo

the zero injection fix for sparse counters is the most underrated part of this writeup -- injecting a synthetic zero on first flush to anchor the cumulative baseline is actaully a pretty elegant solution to a problem that bites almost every team migrating from delta-based systems to prometheus, and the fact that they centralized it in the aggregation tier rather than pushing the fix to every instrumentation callsite is exactly the right call.

zbentley

> Initially, we anticipated that the edge case would have minimal impact, given Prometheus’s widespread adoption and proven reliability in diverse environments. However, as we migrated more users, we started seeing this issue more frequently, and it stalled migration. That's a very professional way of saying "Wait, everyone just lives with this? What the fuck?!" Many such cases in the Prometheus ecosystem.

valyala

It is interesting why Airbnb uses vmagent for streaming aggregation and didn't switch from Mimir to VictoriaMetrics. This could save them a lot of costs on infrastructure and operations, like in cases of Roblox, Spotify, Grammarly and others - https://docs.victoriametrics.com/victoriametrics/casestudies...

Semantic search powered by Rivestack pgvector
4,783 stories · 45,112 chunks indexed