Re: cvsweb anti-bot protection (was: Retrieving MAC address from struct ifnet)

matthew sporleder Thu, 03 Jul 2025 22:27:39 -0700

On Thu, Jul 3, 2025 at 3:38 PM Constantine A. Murenin
<muren...@gmail.com> wrote:
>
> On Thu, 3 Jul 2025 at 04:30, Jörg Sonnenberger <jo...@bec.de> wrote:
> >
> > On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
> > > These AIs literally behave the exact same way as humans; they're
> > > simply dumber and more persistent.  The way CVSweb is designed, it's
> > > easily DoS'able with the default `wget -r` and `wget --recursive` from
> > > probably like 20 years ago?
> >
> > This is complete BS. "wget -r" uses a single connection (at any point in
> > time). It uses a consistent source address. It actually honors
>
> Yes, it's an oversimplification; and you might have to do `wget -e
> robots=off -r` these days.
>
> Yes, a single wget would use a single connection by default, leaving
> the breathing room for the server, since it wouldn't need to do any of
> this work concurrently for a single client.
>
> But what happens when multiple people do it all at once?  Because
> that's what happens with the AI agents.
>
> > robots.txt by default. None of that applies to the current generation of
> > AI scrapers:
> >
> > (1) They have no effective rate limiting mechanism on the origin side.
> > (2) They are intentionally distributing requests to avoid server side
> > rate limits.
> > (3) The combination of the two makes most caching useless.
> > (3) They (intentionally or maliciously) do not honor robots.txt.
> > (4) They are intentionally faking the user agent.
>
> The issue here is that the robots.txt was effectively thrown out of
> the window the minute every website went to block every bot except for
> Googlebot.
>
> How exactly do you expect Googlebot could have started back in the day
> if robots.txt files everywhere were as restrictive as they are today,
> and where all unknown bots, including Googlebot, would have already
> been pre-blocked back then?
>
> I'm not buying the idea that caching or rate limiting is ineffective.
> The downtime happens when the server is overwhelmed, connections pile
> up, and we end up in a situation, when nothing works for anyone, as
> existing connections stall and the tail latency dominates all open
> connections.
>
> Do you have any evidence that the bots don't back out even at that time?
>
> nginx allows having rate limits by the resource, not just by the IP
> address; and it also allows delayed processing, which would signal to
> the client that the server is overloaded.  This would ensure that the
> system fails gracefully, instead of going into swapping and runaway
> mode.  And it also allows the bot to detect that they're causing a
> load issue, and back off appropriately.
>
> For example, each page can be cached for several hours and served from
> cache without facing any limits, then the main non-revision pages
> (100k total pages) could remain the priority, with the rest of the
> revision ones (100000k pages) being on the lowest priority with the
> highest resource limits.
>
> There's been recent media reports of a 108s (108594ms) — almost 2
> minutes — delay due to Anubis proof-of-work on GNOME GitLab when "many
> people access the same link simultaneously—such as when a GitLab link
> is shared in a chat room" — how does that make any sense when the
> whole thing could have been cached by nginx cheaply?
>
> If someone shares our CVSweb link on Slashdot or in a chatroom, would
> everyone also be required to waste 2 minutes doing proof-of-work to
> see the exact same page generated hundreds of separate times?  How
> exactly is that better than having nginx caching and resource limits
> do their thing?
>
> C.


Source browsing software is slow and not optimized for high traffic
websites. The specialty repo browsing software you propose (prioritize
this vs that, be cache-friendly, have good SEO) does not exist to my
knowledge and certainly not for Mercurial, an extra-slow and heavy
piece of software.

Caching the ~infinity iterations (300k+ commits to src) of diffs,
blames, file histories, user graphs, etc is not practical but, to your
point, we could attempt to cache it on the CDN. I don't know the
specific behavior of these bots but regular google and bing were
taking down anonhg before chatgpt existed.

Even github doesn't attempt to cache this url:
https://github.com/NetBSD/src/blame/2155f4d97ebe976de16665789fdf2c8d06f7e8e3/etc/etc.amiga/Makefile.inc#L7
probably because no one will ever hit it again (except bots following
the link from this email!). [cache-control: max-age=0, private,
must-revalidate] and it will simply fall out of cache or, more likely,
take up a slot in the finite cache resources wasting a potential HIT.
Also you need to invalidate the cache on every commit.

Google lies about the number of results because deep paging is
difficult and slow.

Re: cvsweb anti-bot protection (was: Retrieving MAC address from struct ifnet)

Reply via email to