We should get rid of average CPU utilization
JeremyTheo
31 points
23 comments
May 22, 2026
Related Discussions
Found 5 related stories in 77.0ms across 8,303 title embeddings via pgvector HNSW
- The 1979 Design Choice Breaking AI Workloads za_mike157 · 24 pts · March 09, 2026 · 44% similar
- Anthropic discourages Claud demand during peak productivity hours dude250711 · 15 pts · March 26, 2026 · 44% similar
- Apple AI servers unused in warehouses due to low Apple Intelligence usage _____k · 85 pts · March 02, 2026 · 43% similar
- Visualizing CPU Pipelining (2024) flipacholas · 76 pts · April 13, 2026 · 42% similar
- The Perils of an Over-Optimized Life jethronethro · 13 pts · April 13, 2026 · 41% similar
Discussion Highlights (11 comments)
rimworld
great article thanks
ahartmetz
No, we shouldn't. We should measure latency if we care about latency.
zeafoamrun
Same thing when it comes to memory. The rabbit hole goes on forever, and metrics lie to you if you don't know how to interpret them properly.
ksk23
TLDR; if app slow, give more resources
techpression
Lovely read, if you’ve ever had even remotely similar issues (you think you’re looking at the right places but you’re not) it read like a detective novel.
CodesInChaos
It's well known that many throttling implementations are broken, usually by design. You shouldn't blame the CPU utilization metric for that footgun. In a well designed scheduler, a task that has been granted an allotment of at least n cores, should never get throttled to less than n cores at any time. It can be limited to less than n cores if CPU utilization is at 100% and another task gets scheduled at the time, since that's unavoidable when you oversubscribe the available resources.
VimEscapeArtist
Let’s measure temperature :)
arianvanp
A more general metric that is useful to watch for is pressure stall information for CPU, IO and Memory. https://docs.kernel.org/accounting/psi.html I made a Prometheus exporter for it: https://github.com/arianvp/cgroup-exporter
nairboon
No, not at all. Why get rid of a low-level statistical measure? It's not even quite clear what the article argues against. htop doesn't even show you "average CPU utilization", it provides a sample of the current CPU utilization. To me the problem appears to be that they try to do some hard realtime computing with strict time guarantees, but are so far up the stack (golang library, golang scheduler, docker, kubernetes, virtualization, etc.), that they don't realize that this stack can't guarantee you realtime computing. CPU utilization is a very low-level measure and, in this stack, is only indirectly related to the observed timeouts.
JanMa
I've learned the hard way that CPU resource limits in K8S are a bad idea, as can be seen in this post. Just use CPU requests without limits so the scheduler has an estimate of your applications CPU requirements, but it can burst to use more CPU when it's available. With memory of course you should set a limit and from experience it should be the same as your memory requests.
cyclonereef
I've worked with plenty of companies that provide some sort of hosting for enterprise customers, and the number of times I've seen even senior admins use only CPU Utilisation and Memory In-Use investigating an issue is disheartening. And given that CPU Utilisation is an aggregate of all time != CPU idle, the same utilisation number can mean very different underlying system states. There's something like a dozen different CPU metrics that can be referred to by the OS alone.