Tailslayer: Library for reducing tail latency in RAM reads
hasheddan
48 points
16 comments
April 07, 2026
Related Discussions
Found 5 related stories in 35.5ms across 3,871 title embeddings via pgvector HNSW
- A tail-call interpreter in (nightly) Rust g0xA52A2A · 147 pts · April 05, 2026 · 47% similar
- Right-sizes LLM models to your system's RAM, CPU, and GPU bilsbie · 76 pts · March 01, 2026 · 47% similar
- Debunking Zswap and Zram Myths javierhonduco · 186 pts · March 24, 2026 · 43% similar
- Linux Page Faults, MMAP, and userfaultfd for fast sandbox boot times shayonj · 14 pts · March 12, 2026 · 43% similar
- Llm9p: LLM as a Plan 9 file system mleroy · 15 pts · March 08, 2026 · 42% similar
Discussion Highlights (8 comments)
shaicoleman
* Announcement [1] * Video [2] 1. https://x.com/lauriewired/status/2041566601426956391 ( https://xcancel.com/lauriewired/status/2041566601426956391 ) 2. https://www.youtube.com/watch?v=KKbgulTp3FE
jeffbee
This readme, this header do not seem to discuss in any way the tradeoff, which is that you're paying by the same factor with median latency to buy lower tail latency. Nobody thinks of a load as taking 800 cycles but that is the baseline load latency here. Also, having sacrificed my own mental health to watch the disgustingly self-promoting hour-long video that announces this small git commit, I can confidently say that "Graviton doesn't have any performance counters" is one of the wrongest things I've heard in a long time. Overall, I give it an F. Anyway if you want to hide memory refresh latency, IBM zEnterprise is your platform. It completely hides refresh latency by steering loads to the non-refreshing bank, and it only costs half the space, not up to 92% of your space like this technique.
ysleepy
Loved the details about how memory access actually maps addresses to channels, ranks, blocks and whatever, this is rarely discussed. Not sure how this works for larger data structures, but my first thought was that this should be implemented as some microcode or instruction. Most computation is not thaat jitter sensitive, perception is not really in the nano to microsecond scale, but maybe a cool gadget for like dtrace or interrupt handers etc.
jagged-chisel
My understanding is that this is making a trade off of using more space to get shorter access times. Do I have that right? OT: Tail Slayer. Not Tails Layer. My brain took longer to parse that than I’d have wanted.
addaon
This addresses the “short long tail” (known bounded variance due to the multiple physical operations underlying a single logical memory op), but for hard real time applications the “long long tail” of correctable-ECC-error—and-scrub may be the critical case.
inetknght
@lauriewired, I think the most interesting thing that I learned from this is that memory refresh causes readwrite stalls. For some reason I thought it was completely asynchronous. But otherwise, nice work tying all the concepts together. You might want to get some better model trains though.
TeapotNotKettle
Very interesting work. But practically speaking, in a real application - isn’t any performance benefit going to be lost by the reduced cache hit rate caused by having a larger working set? Or are the reads of all-but-one of the replicas non-cached? Apologies if I am missing something.
6keZbCECT2uB
I like the project: taking it from refresh-induced tail latency to racing threads assigned to addresses that are de-correlated by memory channel. Connecting this to a lookup table which is broadcasted across memory channels to let the lookup paths race makes for a nice narrative, but framing this as reducing tail latency confused me because I was expecting this to do a join where a single reader gets the faster of the two racers. From a narrative standpoint, I agree it makes more sense to focus on a duplicated lookup table and fastest wins, however, from an engineering standpoint, framing it in terms of channel de-correlated reads has more possibilities. For example, if you need to evaluate multiple parallel ML models to get a result then by intentionally partitioning your models by channel you could ensure that a model does reads on only fast data or only slow data. ML models might not be that interesting since they are good candidates for being resident in L3.