Making deep learning go brrrr from first principles (2022)
tosh
163 points
59 comments
May 23, 2026
Related Discussions
Found 5 related stories in 94.1ms across 8,303 title embeddings via pgvector HNSW
- There Will Be a Scientific Theory of Deep Learning jamie-simon · 191 pts · April 24, 2026 · 56% similar
- DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles mji · 31 pts · April 25, 2026 · 50% similar
- A Visual Introduction to Machine Learning (2015) vismit2000 · 343 pts · March 15, 2026 · 50% similar
- DeepSeek V4: The Open-Source Model Frontier Labs Feared HelloAi · 61 pts · May 15, 2026 · 48% similar
- Executing programs inside transformers with exponentially faster inference u1hcw9nx · 17 pts · March 12, 2026 · 48% similar
Discussion Highlights (10 comments)
tosh
> in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS wild
noosphr
>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model. https://arxiv.org/abs/1912.02292
jdw64
Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch
big-chungus4
How does x.cos().cos() work faster than doing two cos calls separately? Like the first cos call returns a tensor either way, the only difference is that it's not assigned to a variable. But how is it even possible know that difference in python?
ollin
This post is a classic! Also recommended: Horace also gave a related talk (covering the high-level picture of modern ML Systems) at Jane Street in Dec 2024 https://www.youtube.com/watch?v=139UPjoq7Kw
axpy906
Needs 2022 in title
marketingan
Deep learning is just glorified linear algebra. Master the progression: Feed-forward CNN RNN LSTM Attention. You don't even need a GPU to understand the climax; Karpathy’s llama2.c implements a full transformer inference engine in just ~300 lines of C using SIMD pragmas for CPU execution.
ThouYS
I feel like there is no portable advice for performance. A torch model exported as onnx is a different model. That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep. And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess
xiaod
I'd want to see more about the failure modes. Production systems need graceful degradation more than optimal performance.
liuliu
One thing people seems not to acknowledge, and this post made it super clear is that NVIDIA kept their lead extremely well in a few years of very high growth. The TFLOPs, the bandwidth, the interconnect mentioned in this post continues to grow at exponential rate with no sign of stopping yet. This is a 30-year-old incumbent reminding you. The willingness to compete from NVIDIA is just simply remarkable.