Miasma: A tool to trap AI web scrapers in an endless poison pit

LucidLynx 305 points 221 comments March 29, 2026
github.com · View on Hacker News

Discussion Highlights (20 comments)

splitbrainhack

-1 for the name

Imustaskforhelp

I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.

GaggiX

These projects are the new "To-Do List" app.

meta-level

Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

madeofpalk

Is there any evidence or hints that these actually work? It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

rvz

> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt! Can't the LLMs just ignore or spoof their user agents anyway?

snehesht

Why not simply blacklist or rate limit those bot IP’s ?

imdsm

Applied model collapse

obsidianbases1

Why do this though? It's like if someone was trying to "trap" search crawlers back in the early 2000s. Seems counterproductive

tasuki

> If you have a public website, they are already stealing your work. I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

aldousd666

This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.

foxes

Wonder if you can just avoid hiding it to make it more believable Why not have a library of babel esq labrinth visible to normal users on your website, Like anti surveillance clothing or something they have to sift through

nosmokewhereiam

My asthmar I'm assuming this is a reference to Lord of the flies

jstanley

If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.

rob

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

theandrewbailey

Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...

ninjagoo

This is essentially machine-generated spam. The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai? Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players. Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results. Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides. This project's selective protection of the major players reinforces that effect; from the README: " Be sure to protect friendly bots and search engines from Miasma in your robots.txt! User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

ninjagoo

Isn't this a trope at this point? That AI companies are indiscriminately training on random websites? Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input? Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input? Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data? If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning? Is this a solution in search of a problem?

superkuh

Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".

bobosola

I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced. Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all. So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner. [0] https://developers.google.com/search/docs/essentials/spam-po...

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed