148,421 posts later: A model to predict the Hacker News front page

crimeacs 13 points 3 comments May 12, 2026
hackernews.foresyn.ai · View on Hacker News

Discussion Highlights (2 comments)

crimeacs

Author here - quick context so this doesn’t read like a hot take. Dataset: 148,421 public HN stories from Algolia since 2007, filtered to score ≥5. Split is strictly chronological: train < Jul 2025, val Aug–Dec 2025, holdout Jan 2026+. Random splits are misleading here because kNN features leak future neighbors. Model: LightGBM with 4 heads: median, p10, p90, and score ≥100 classifier with isotonic calibration. Compiled to plain JS via m2cgen and runs inside a Vercel function — no Python/ONNX/runtime. ~10 MB bundle, sub-ms inference. Holdout: * Spearman ρ = 0.33 on log_score * MAE log = 1.65, roughly ~5x off in raw points * AUC for score ≥100 = 0.67 * Precision@30 = 0.83 So: not magic. About one-third of the signal seems recoverable from title/context. AUC is below ontology2’s 2014 title-only baseline, around/above recent BERT fine-tunes I found. Two things I haven’t seen elsewhere: 1. Comment simulator grounds every fake comment in a real top comment from a kNN neighbor, with `[src]`. 2. `/predictions` runs a live calibration ledger against actual HN top 30 every 10 min, so the model can’t hide behind a static benchmark. Open source, MIT, training scripts included: https://github.com/crimeacs/foresyn-hackernews I ran the submitted title through the model first. It predicted 32/99 virality and ~12 points. The ledger will soon tell us whether it was calibrated. Roast away.

metadat

I tested it on an oldie but goody: The Bullshit Web https://pxlnv.com/blog/bullshit-web/ But the results were really lackluster compared to what happened IRL ( https://news.ycombinator.com/item?id=17655089 ). Am I missing something? Definitely a neat idea!

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed