On Thu, 3 Jul 2025 at 04:30, Jörg Sonnenberger <jo...@bec.de> wrote: > > On 7/3/25 6:23 AM, Constantine A. Murenin wrote: > > These AIs literally behave the exact same way as humans; they're > > simply dumber and more persistent. The way CVSweb is designed, it's > > easily DoS'able with the default `wget -r` and `wget --recursive` from > > probably like 20 years ago? > > This is complete BS. "wget -r" uses a single connection (at any point in > time). It uses a consistent source address. It actually honors
Yes, it's an oversimplification; and you might have to do `wget -e robots=off -r` these days. Yes, a single wget would use a single connection by default, leaving the breathing room for the server, since it wouldn't need to do any of this work concurrently for a single client. But what happens when multiple people do it all at once? Because that's what happens with the AI agents. > robots.txt by default. None of that applies to the current generation of > AI scrapers: > > (1) They have no effective rate limiting mechanism on the origin side. > (2) They are intentionally distributing requests to avoid server side > rate limits. > (3) The combination of the two makes most caching useless. > (3) They (intentionally or maliciously) do not honor robots.txt. > (4) They are intentionally faking the user agent. The issue here is that the robots.txt was effectively thrown out of the window the minute every website went to block every bot except for Googlebot. How exactly do you expect Googlebot could have started back in the day if robots.txt files everywhere were as restrictive as they are today, and where all unknown bots, including Googlebot, would have already been pre-blocked back then? I'm not buying the idea that caching or rate limiting is ineffective. The downtime happens when the server is overwhelmed, connections pile up, and we end up in a situation, when nothing works for anyone, as existing connections stall and the tail latency dominates all open connections. Do you have any evidence that the bots don't back out even at that time? nginx allows having rate limits by the resource, not just by the IP address; and it also allows delayed processing, which would signal to the client that the server is overloaded. This would ensure that the system fails gracefully, instead of going into swapping and runaway mode. And it also allows the bot to detect that they're causing a load issue, and back off appropriately. For example, each page can be cached for several hours and served from cache without facing any limits, then the main non-revision pages (100k total pages) could remain the priority, with the rest of the revision ones (100000k pages) being on the lowest priority with the highest resource limits. There's been recent media reports of a 108s (108594ms) — almost 2 minutes — delay due to Anubis proof-of-work on GNOME GitLab when "many people access the same link simultaneously—such as when a GitLab link is shared in a chat room" — how does that make any sense when the whole thing could have been cached by nginx cheaply? If someone shares our CVSweb link on Slashdot or in a chatroom, would everyone also be required to waste 2 minutes doing proof-of-work to see the exact same page generated hundreds of separate times? How exactly is that better than having nginx caching and resource limits do their thing? C.