Miasma: A tool to trap AI web scrapers in an endless poison pit
LucidLynx
305 points
221 comments
March 29, 2026
Related Discussions
Found 5 related stories in 55.2ms across 3,471 title embeddings via pgvector HNSW
- Aggressive AI scrapers are making it kinda suck to run wikis cookmeplox · 19 pts · March 13, 2026 · 57% similar
- Document poisoning in RAG systems: How attackers corrupt AI's sources aminerj · 98 pts · March 12, 2026 · 50% similar
- Daemons that clean up the mess agents leave behind neom · 18 pts · March 18, 2026 · 49% similar
- Show HN: I built an SDK that scrambles HTML so scrapers get garbage larsmosr · 16 pts · March 12, 2026 · 48% similar
- A curated list of AI slops xiaoyu2006 · 15 pts · March 16, 2026 · 47% similar
Discussion Highlights (20 comments)
splitbrainhack
-1 for the name
Imustaskforhelp
I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.
GaggiX
These projects are the new "To-Do List" app.
meta-level
Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?
madeofpalk
Is there any evidence or hints that these actually work? It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.
rvz
> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt! Can't the LLMs just ignore or spoof their user agents anyway?
snehesht
Why not simply blacklist or rate limit those bot IP’s ?
imdsm
Applied model collapse
obsidianbases1
Why do this though? It's like if someone was trying to "trap" search crawlers back in the early 2000s. Seems counterproductive
tasuki
> If you have a public website, they are already stealing your work. I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!
aldousd666
This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.
foxes
Wonder if you can just avoid hiding it to make it more believable Why not have a library of babel esq labrinth visible to normal users on your website, Like anti surveillance clothing or something they have to sift through
nosmokewhereiam
My asthmar I'm assuming this is a reference to Lord of the flies
jstanley
If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.
rob
"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"
theandrewbailey
Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...
ninjagoo
This is essentially machine-generated spam. The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai? Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players. Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results. Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides. This project's selective protection of the major players reinforces that effect; from the README: " Be sure to protect friendly bots and search engines from Miasma in your robots.txt! User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "
ninjagoo
Isn't this a trope at this point? That AI companies are indiscriminately training on random websites? Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input? Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input? Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data? If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning? Is this a solution in search of a problem?
superkuh
Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".
bobosola
I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced. Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all. So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner. [0] https://developers.google.com/search/docs/essentials/spam-po...