'Comically bad' datasets used to train clinical models for stroke and diabetes
leephillips
56 points
10 comments
May 19, 2026
Related Discussions
Found 5 related stories in 88.4ms across 8,303 title embeddings via pgvector HNSW
- AI disease-prediction models were trained on dubious data Anon84 · 11 pts · April 15, 2026 · 60% similar
- Scientific datasets are riddled with copy-paste errors jruohonen · 71 pts · April 19, 2026 · 52% similar
- Twice this week, I have come across embarassingly bad data hermitcrab · 80 pts · March 29, 2026 · 52% similar
- Ontario auditors find doctors' AI note takers routinely blow basic facts sohkamyung · 186 pts · May 14, 2026 · 46% similar
- Marcus AI Claims Dataset davegoldblatt · 60 pts · March 04, 2026 · 45% similar
Discussion Highlights (2 comments)
Legend2440
A lot of researchers think their job is to build models. They don't want to collect their own data, so they go find whatever dataset they can on kaggle or from a previous paper or wherever. This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.
matusp
Dataset quality is a huge issue in ML in general. You can often list a few dozen random samples from any given dataset and you will find out something weird going on instantly.