We should get rid of average CPU utilization

JeremyTheo 31 points 23 comments May 22, 2026
www.theocharis.dev · View on Hacker News

Discussion Highlights (11 comments)

rimworld

great article thanks

ahartmetz

No, we shouldn't. We should measure latency if we care about latency.

zeafoamrun

Same thing when it comes to memory. The rabbit hole goes on forever, and metrics lie to you if you don't know how to interpret them properly.

ksk23

TLDR; if app slow, give more resources

techpression

Lovely read, if you’ve ever had even remotely similar issues (you think you’re looking at the right places but you’re not) it read like a detective novel.

CodesInChaos

It's well known that many throttling implementations are broken, usually by design. You shouldn't blame the CPU utilization metric for that footgun. In a well designed scheduler, a task that has been granted an allotment of at least n cores, should never get throttled to less than n cores at any time. It can be limited to less than n cores if CPU utilization is at 100% and another task gets scheduled at the time, since that's unavoidable when you oversubscribe the available resources.

VimEscapeArtist

Let’s measure temperature :)

arianvanp

A more general metric that is useful to watch for is pressure stall information for CPU, IO and Memory. https://docs.kernel.org/accounting/psi.html I made a Prometheus exporter for it: https://github.com/arianvp/cgroup-exporter

nairboon

No, not at all. Why get rid of a low-level statistical measure? It's not even quite clear what the article argues against. htop doesn't even show you "average CPU utilization", it provides a sample of the current CPU utilization. To me the problem appears to be that they try to do some hard realtime computing with strict time guarantees, but are so far up the stack (golang library, golang scheduler, docker, kubernetes, virtualization, etc.), that they don't realize that this stack can't guarantee you realtime computing. CPU utilization is a very low-level measure and, in this stack, is only indirectly related to the observed timeouts.

JanMa

I've learned the hard way that CPU resource limits in K8S are a bad idea, as can be seen in this post. Just use CPU requests without limits so the scheduler has an estimate of your applications CPU requirements, but it can burst to use more CPU when it's available. With memory of course you should set a limit and from experience it should be the same as your memory requests.

cyclonereef

I've worked with plenty of companies that provide some sort of hosting for enterprise customers, and the number of times I've seen even senior admins use only CPU Utilisation and Memory In-Use investigating an issue is disheartening. And given that CPU Utilisation is an aggregate of all time != CPU idle, the same utilisation number can mean very different underlying system states. There's something like a dozen different CPU metrics that can be referred to by the OS alone.

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed