Scientific datasets are riddled with copy-paste errors
jruohonen
71 points
12 comments
April 19, 2026
Related Discussions
Found 5 related stories in 60.3ms across 5,012 title embeddings via pgvector HNSW
- Twice this week, I have come across embarassingly bad data hermitcrab · 80 pts · March 29, 2026 · 54% similar
- Entities enabling scientific fraud at scale (2025) peyton · 276 pts · March 11, 2026 · 50% similar
- Never Trust the Science - On the need to identify bias & interpret data yourself Luc · 23 pts · March 18, 2026 · 48% similar
- AI disease-prediction models were trained on dubious data Anon84 · 11 pts · April 15, 2026 · 46% similar
- False claims in a widely-cited paper qsi · 217 pts · March 26, 2026 · 44% similar
Discussion Highlights (2 comments)
steve_adams_86
This is legitimately so challenging to avoid, because loads of scientific processes are—to some degrees or others—bespoke and difficult to fully streamline and introduce efficient, well-structured, comprehensive QA. A LOT of labour goes into making it work. Most scientists I know and work with are very diligent people who care a lot about the outputs being as correct as possible, but wow, their workflows aren't great. My job is to try and address this in whatever ways are practical for the data and the people doing the science, and it's kind of like Saas in that you think it should be easy enough to spot problems, solve them, and carry on/become a billionaire, but... The world is much more complicated than that, and it's easier to fail in this endeavour than it is to break even. The classic "DropBox is just rsync" or "I could build Airbnb in a weekend" sentiments have their commonalities and counterparts in science, and the reality is similarly defeating and punishing on both sides. Making science go faster while maintaining correctness is exceedingly difficult. There are so many moving parts. So many disparate participants who are wildly technical and capable, or brilliant at studying bacteria in starfish yet terrified to run a command in a terminal. Your user base has virtually nothing in common in terms of ability and willingness to do anything other than get their own work done. It's brutal. So, I sympathize with the authors of these papers and I hope readers don't assume they're bad at what they do or that it's done in bad faith. It's genuinely difficult. An anecdote: I created a tool for validating biodiversity data against a specification called Darwin Core. Initially our published data was failing to validate so much that I thought I'd made the tool wrong. Rather, the spec is so complex and vast that the people I work with were unable to manage to get valid data into the public repositories. And yet! They were able to publish, because the public repositories' own validation is... Invalid. That's the state of things. Granted, the data is still correct enough to be useful, and the errors don't cause the results to indicate anything that they shouldn't. It's more like minor metadata issues, failures to maintain referential integrity across different datasets, etc. But it's a very real, very difficult problem. Science isn't easy at all. So many hoops to jump through, so much rigor, so much data. Mistakes are inevitable.
l5870uoo9y
> It could be either a fat-finger mistake when editing the Excel file or deliberate tampering to cover up real data that didn't tell the right story. I can easily imagine after spending years or decades devoted to discovering a scientific breakthrough that some could be tempted to slightly alter the data. I believe there was some scandal about this a few years back with climate data. Fixing this is however something that AI would do fairly well.