io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap
tanelpoder
61 points
16 comments
March 24, 2026
Related Discussions
Found 5 related stories in 33.7ms across 3,663 title embeddings via pgvector HNSW
- Linux Page Faults, MMAP, and userfaultfd for fast sandbox boot times shayonj · 14 pts · March 12, 2026 · 57% similar
- A tale about fixing eBPF spinlock issues in the Linux kernel y1n0 · 53 pts · March 18, 2026 · 55% similar
- Linux Internals: How /proc/self/mem writes to unwritable memory (2021) medbar · 59 pts · March 08, 2026 · 52% similar
- Apache Iggy: thread-per-core with io_uring in Rust ikatson · 32 pts · March 16, 2026 · 51% similar
- The State of Immutable Linux JustinGarrison · 24 pts · March 27, 2026 · 49% similar
Discussion Highlights (4 comments)
eivanov89
Dear folks, I'm the author of that post. A short summary below. We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases. Happy to share more details about setup or reproduce results.
hcpp
Why was 4K random write chosen as the main workload, and would the conclusion change with sequential I/O?
tanelpoder
I understand that it's the interrupt-based I/O completion workloads that suffered from IOMMU overhead in your tests? IOMMU may induce some interrupt remapping latency, I'd be interested in seeing: 1) interrupt counts (normalized to IOPS) from /proc/interrupts 2) "hardirqs -d" (bcc-tools) output for IRQ handling latency histograms 3) perf record -g output to see if something inside interrupt handling codepath takes longer (on bare metal you can see inside hardirq handler code too) Would be interesting to see if with IOMMU each interrupt handling takes longer on CPU (or is the handling time roughly the same, but interrupt delivery takes longer). There may be some interrupt coalescing thing going on as well (don't know exactly what else gets enabled with IOMMU). Since interrupts are raised "randomly", independently from whatever your app/kernel code is running on CPUs, it's a bit harder to visualize total interrupt overhead in something like flamegraphs, as the interrupt activity is all over the place in the chart. I used flamegraph search/highlight feature to visually identify how much time the interrupt detours took during stress test execution. Example here (scroll down a little): https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usag...
skavi
what was the security situation of whatever is now being protected by the IOMMU before it was enabled by default?