Re: cvsweb anti-bot protection (was: Retrieving MAC address from struct ifnet)

Constantine A. Murenin Thu, 03 Jul 2025 12:37:53 -0700

On Thu, 3 Jul 2025 at 04:30, Jörg Sonnenberger <jo...@bec.de> wrote:
>
> On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
> > These AIs literally behave the exact same way as humans; they're
> > simply dumber and more persistent.  The way CVSweb is designed, it's
> > easily DoS'able with the default `wget -r` and `wget --recursive` from
> > probably like 20 years ago?
>
> This is complete BS. "wget -r" uses a single connection (at any point in
> time). It uses a consistent source address. It actually honors


Yes, it's an oversimplification; and you might have to do `wget -e
robots=off -r` these days.

Yes, a single wget would use a single connection by default, leaving
the breathing room for the server, since it wouldn't need to do any of
this work concurrently for a single client.

But what happens when multiple people do it all at once?  Because
that's what happens with the AI agents.

> robots.txt by default. None of that applies to the current generation of
> AI scrapers:
>
> (1) They have no effective rate limiting mechanism on the origin side.
> (2) They are intentionally distributing requests to avoid server side
> rate limits.
> (3) The combination of the two makes most caching useless.
> (3) They (intentionally or maliciously) do not honor robots.txt.
> (4) They are intentionally faking the user agent.

The issue here is that the robots.txt was effectively thrown out of
the window the minute every website went to block every bot except for
Googlebot.

How exactly do you expect Googlebot could have started back in the day
if robots.txt files everywhere were as restrictive as they are today,
and where all unknown bots, including Googlebot, would have already
been pre-blocked back then?

I'm not buying the idea that caching or rate limiting is ineffective.
The downtime happens when the server is overwhelmed, connections pile
up, and we end up in a situation, when nothing works for anyone, as
existing connections stall and the tail latency dominates all open
connections.

Do you have any evidence that the bots don't back out even at that time?

nginx allows having rate limits by the resource, not just by the IP
address; and it also allows delayed processing, which would signal to
the client that the server is overloaded.  This would ensure that the
system fails gracefully, instead of going into swapping and runaway
mode.  And it also allows the bot to detect that they're causing a
load issue, and back off appropriately.

For example, each page can be cached for several hours and served from
cache without facing any limits, then the main non-revision pages
(100k total pages) could remain the priority, with the rest of the
revision ones (100000k pages) being on the lowest priority with the
highest resource limits.

There's been recent media reports of a 108s (108594ms) — almost 2
minutes — delay due to Anubis proof-of-work on GNOME GitLab when "many
people access the same link simultaneously—such as when a GitLab link
is shared in a chat room" — how does that make any sense when the
whole thing could have been cached by nginx cheaply?

If someone shares our CVSweb link on Slashdot or in a chatroom, would
everyone also be required to waste 2 minutes doing proof-of-work to
see the exact same page generated hundreds of separate times?  How
exactly is that better than having nginx caching and resource limits
do their thing?

C.

Re: cvsweb anti-bot protection (was: Retrieving MAC address from struct ifnet)

Reply via email to