148,421 posts later: A model to predict the Hacker News front page
crimeacs
13 points
3 comments
May 12, 2026
Related Discussions
Found 5 related stories in 80.5ms across 8,303 title embeddings via pgvector HNSW
- Show HN: I built a tool that helps predict HN front page success margotli · 25 pts · May 03, 2026 · 65% similar
- Show HN: Hackerbrief – Top posts on Hacker News summarized daily p0u4a · 66 pts · March 16, 2026 · 62% similar
- Show HN: State of the Art of Coding Models, According to Hacker News Commenters yunusabd · 82 pts · May 02, 2026 · 58% similar
- Profiling Hacker News users based on their comments simonw · 60 pts · March 22, 2026 · 57% similar
- Crow Watch: A Hacker News Alternative medv · 12 pts · March 09, 2026 · 56% similar
Discussion Highlights (2 comments)
crimeacs
Author here - quick context so this doesn’t read like a hot take. Dataset: 148,421 public HN stories from Algolia since 2007, filtered to score ≥5. Split is strictly chronological: train < Jul 2025, val Aug–Dec 2025, holdout Jan 2026+. Random splits are misleading here because kNN features leak future neighbors. Model: LightGBM with 4 heads: median, p10, p90, and score ≥100 classifier with isotonic calibration. Compiled to plain JS via m2cgen and runs inside a Vercel function — no Python/ONNX/runtime. ~10 MB bundle, sub-ms inference. Holdout: * Spearman ρ = 0.33 on log_score * MAE log = 1.65, roughly ~5x off in raw points * AUC for score ≥100 = 0.67 * Precision@30 = 0.83 So: not magic. About one-third of the signal seems recoverable from title/context. AUC is below ontology2’s 2014 title-only baseline, around/above recent BERT fine-tunes I found. Two things I haven’t seen elsewhere: 1. Comment simulator grounds every fake comment in a real top comment from a kNN neighbor, with `[src]`. 2. `/predictions` runs a live calibration ledger against actual HN top 30 every 10 min, so the model can’t hide behind a static benchmark. Open source, MIT, training scripts included: https://github.com/crimeacs/foresyn-hackernews I ran the submitted title through the model first. It predicted 32/99 virality and ~12 points. The ledger will soon tell us whether it was calibrated. Roast away.
metadat
I tested it on an oldie but goody: The Bullshit Web https://pxlnv.com/blog/bullshit-web/ But the results were really lackluster compared to what happened IRL ( https://news.ycombinator.com/item?id=17655089 ). Am I missing something? Definitely a neat idea!