'Comically bad' datasets used to train clinical models for stroke and diabetes

leephillips 56 points 10 comments May 19, 2026

retractionwatch.com · View on Hacker News

Discussion Highlights (2 comments)

Legend2440

A lot of researchers think their job is to build models. They don't want to collect their own data, so they go find whatever dataset they can on kaggle or from a previous paper or wherever. This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.

matusp

Dataset quality is a huge issue in ML in general. You can often list a few dozen random samples from any given dataset and you will find out something weird going on instantly.

'Comically bad' datasets used to train clinical models for stroke and diabetes

Discussion Highlights (2 comments)

Related Discussions