NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
sdpmas
147 points
26 comments
March 04, 2026
Related Discussions
Found 5 related stories in 49.2ms across 3,471 title embeddings via pgvector HNSW
- NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute sdpmas · 122 pts · March 19, 2026 · 80% similar
- GPT 5.4 Thinking and Pro twtw99 · 64 pts · March 05, 2026 · 54% similar
- GPT-5.4 meetpateltech · 156 pts · March 05, 2026 · 54% similar
- GPT-5.4 mudkipdev · 739 pts · March 05, 2026 · 54% similar
- Cross-Model Void Convergence: GPT-5.2 and Claude Opus 4.6 Deterministic Silence rayanpal_ · 50 pts · March 22, 2026 · 54% similar
Discussion Highlights (9 comments)
suddenlybananas
Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
archermarks
Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
lzaborowski
I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed. If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
navvyeanand
Amazing job!
kseniamorph
Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
linolevan
There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead. [0] https://www.alphaxiv.org/abs/2509.14786
refulgentis
This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
bee_rider
> Directions we think are wide open > Second-order optimizers and natural gradient methods Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
shubhamintech
The ensemble diversity point is underrated. Most teams pick one architecture and ship it, so the finding that architectural variation beats random seeds is interesting but hard to act on in practice. The more useful takeaway: low-data regimes expose every bad design decision you normally paper over with more tokens. It's basically a forcing function for understanding what actually drives model quality vs. what's just scale noise.