Attention Residuals
GaggiX
148 points
21 comments
March 20, 2026
Related Discussions
Found 5 related stories in 42.3ms across 3,471 title embeddings via pgvector HNSW
- AI (2014) bjornroberg · 69 pts · March 20, 2026 · 49% similar
- A Visual Guide to Attention Variants in Modern LLMs Anon84 · 17 pts · March 22, 2026 · 48% similar
- Autoresearch: Agents researching on single-GPU nanochat training automatically simonpure · 82 pts · March 07, 2026 · 45% similar
- Learning athletic humanoid tennis skills from imperfect human motion data danielmorozoff · 137 pts · March 15, 2026 · 44% similar
- Where did you think the training data was coming from? speckx · 48 pts · March 11, 2026 · 43% similar
Discussion Highlights (5 comments)
jszymborski
This is reminds me of the input gates of an LSTM.
jjcm
Two things stand out to me with this: 1. Drops compute required for training by ~20%. This approach wont just help the ever escalating model sizes larger companies are pushing for, it means things like autoresearch can iterate on new model architectures faster. 2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results. This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.
westurner
ScholarlyArticle: "Attention Residuals" (2026) https://arxiv.org/abs/2603.15031 : > Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals ( AttnRes ), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. [...]
jryio
This is the key piece > Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.
Murfalo
Amazingly, the first author is a high school student! https://nathanchen.me/public/About%20me.html