On Thu, Jul 3, 2025 at 3:38 PM Constantine A. Murenin <muren...@gmail.com> wrote: > > On Thu, 3 Jul 2025 at 04:30, Jörg Sonnenberger <jo...@bec.de> wrote: > > > > On 7/3/25 6:23 AM, Constantine A. Murenin wrote: > > > These AIs literally behave the exact same way as humans; they're > > > simply dumber and more persistent. The way CVSweb is designed, it's > > > easily DoS'able with the default `wget -r` and `wget --recursive` from > > > probably like 20 years ago? > > > > This is complete BS. "wget -r" uses a single connection (at any point in > > time). It uses a consistent source address. It actually honors > > Yes, it's an oversimplification; and you might have to do `wget -e > robots=off -r` these days. > > Yes, a single wget would use a single connection by default, leaving > the breathing room for the server, since it wouldn't need to do any of > this work concurrently for a single client. > > But what happens when multiple people do it all at once? Because > that's what happens with the AI agents. > > > robots.txt by default. None of that applies to the current generation of > > AI scrapers: > > > > (1) They have no effective rate limiting mechanism on the origin side. > > (2) They are intentionally distributing requests to avoid server side > > rate limits. > > (3) The combination of the two makes most caching useless. > > (3) They (intentionally or maliciously) do not honor robots.txt. > > (4) They are intentionally faking the user agent. > > The issue here is that the robots.txt was effectively thrown out of > the window the minute every website went to block every bot except for > Googlebot. > > How exactly do you expect Googlebot could have started back in the day > if robots.txt files everywhere were as restrictive as they are today, > and where all unknown bots, including Googlebot, would have already > been pre-blocked back then? > > I'm not buying the idea that caching or rate limiting is ineffective. > The downtime happens when the server is overwhelmed, connections pile > up, and we end up in a situation, when nothing works for anyone, as > existing connections stall and the tail latency dominates all open > connections. > > Do you have any evidence that the bots don't back out even at that time? > > nginx allows having rate limits by the resource, not just by the IP > address; and it also allows delayed processing, which would signal to > the client that the server is overloaded. This would ensure that the > system fails gracefully, instead of going into swapping and runaway > mode. And it also allows the bot to detect that they're causing a > load issue, and back off appropriately. > > For example, each page can be cached for several hours and served from > cache without facing any limits, then the main non-revision pages > (100k total pages) could remain the priority, with the rest of the > revision ones (100000k pages) being on the lowest priority with the > highest resource limits. > > There's been recent media reports of a 108s (108594ms) — almost 2 > minutes — delay due to Anubis proof-of-work on GNOME GitLab when "many > people access the same link simultaneously—such as when a GitLab link > is shared in a chat room" — how does that make any sense when the > whole thing could have been cached by nginx cheaply? > > If someone shares our CVSweb link on Slashdot or in a chatroom, would > everyone also be required to waste 2 minutes doing proof-of-work to > see the exact same page generated hundreds of separate times? How > exactly is that better than having nginx caching and resource limits > do their thing? > > C.
Source browsing software is slow and not optimized for high traffic websites. The specialty repo browsing software you propose (prioritize this vs that, be cache-friendly, have good SEO) does not exist to my knowledge and certainly not for Mercurial, an extra-slow and heavy piece of software. Caching the ~infinity iterations (300k+ commits to src) of diffs, blames, file histories, user graphs, etc is not practical but, to your point, we could attempt to cache it on the CDN. I don't know the specific behavior of these bots but regular google and bing were taking down anonhg before chatgpt existed. Even github doesn't attempt to cache this url: https://github.com/NetBSD/src/blame/2155f4d97ebe976de16665789fdf2c8d06f7e8e3/etc/etc.amiga/Makefile.inc#L7 probably because no one will ever hit it again (except bots following the link from this email!). [cache-control: max-age=0, private, must-revalidate] and it will simply fall out of cache or, more likely, take up a slot in the finite cache resources wasting a potential HIT. Also you need to invalidate the cache on every commit. Google lies about the number of results because deep paging is difficult and slow.