'Comically bad' datasets used to train clinical models for stroke and diabetes

leephillips 56 points 10 comments May 19, 2026
retractionwatch.com · View on Hacker News

Discussion Highlights (2 comments)

Legend2440

A lot of researchers think their job is to build models. They don't want to collect their own data, so they go find whatever dataset they can on kaggle or from a previous paper or wherever. This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.

matusp

Dataset quality is a huge issue in ML in general. You can often list a few dozen random samples from any given dataset and you will find out something weird going on instantly.

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed