89 points | by jcat12314 hours ago
Agents that trigger the first level of rate-limiting go through a "tarpit" that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It's impossible to block them via robots.txt, and I'm trying to avoid using too big of a hammer on my CloudFlare settings.
EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:
2024-10-31T02:30:23.312139Z WARN progscrape::web: User hit soft rate limit: ratelimit=soft ip="20.171.206.77" browser=Some("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)") method=GET uri=/?search=science.org
https://github.com/progscrape/progscrape/blob/master/web/src...
Here's where we handle the rate limits:
https://github.com/progscrape/progscrape/blob/master/web/src...
I actually misremembered my implementation. It's rolling counting bloom filters, not a token bucket. :)
Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.
full disclosure: worked there [edit: google] a while ago, not in search, not in AI.
As someone who's built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I'm wondering whether abusers are going to make it harder for legitimate crawlers to operate.
I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare's bot protection.
I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.
In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.
AI crawlers are just out of control
It is not. We rely on more than User Agents because they are too often faked, so it is not just marketing. There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.
Great! What are these signals? That seems to be the meat of the post but it's conspicuously absent. How are we supposed to validate the post?
Imagine you were a vendor who were trying to trick the author into divulging his methods. Can a stranger on the Internet be trusted?
“Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok)”
90% of their crawler traffic (which is 1% of their total traffic) is ByteDance.
“”” Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok) “””
With salaries though finding an externally managed solution might be cheaper.
I mean except known good actors.
I guess known actors would need a verifiable signature
"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.
https://developers.google.com/search/docs/crawling-indexing/...
https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...