Cloudflare crawl endpoint

jeffpalmer 216 points 96 comments March 10, 2026
developers.cloudflare.com · View on Hacker News

Discussion Highlights (20 comments)

triwats

this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring

jasongill

I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it? Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

8cvor6j844qw_d6

Does this bypass their own anti-AI crawl measures? I'll need to test it out, especially with the labyrinth.

memothon

I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!

Imustaskforhelp

This might be really great! I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space. I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers. And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure. So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ? Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.

ljm

Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too. And they can pull it off because of their reach over the internet with the free DNS.

pupppet

Cloudflare getting all the cool toys. AWS, anyone awake over there?

jppope

This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.

rvz

Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.

babelfish

Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?

everfrustrated

Will this crawler be run behind or infront of their bot blocker logic?

greatgib

All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block LLMs and bots to come scraping them. Look how bad it is. And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.

binarymax

Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.

devnotes77

Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed. Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware. The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless. The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.

devnotes77

To clarify the two questions raised: First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call. Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs. The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.

patchnull

The main win here is abstracting away browser context lifecycle management. Anyone who has run Puppeteer on Workers knows the pain of handling cold starts, context reuse, and timeout cascading across navigation steps. Having crawl() bundle render-then-extract into one call covers maybe 80% of scraping use cases. The remaining 20% where you need request interception or pre-render script injection still needs the full Browser Rendering API, but for pulling structured data from public pages this is a big simplification over managing session state yourself.

radium3d

Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal. ``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ``` Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura

skybrian

If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?

arjie

Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.

Normal_gaussian

"Well-behaved bot - Honors robots.txt directives, including crawl-delay" From the behaviour of our peers, this seems to be the real headline news.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed