> [...] fairly typical load but I've documented surges upwards of 2500+
> new TCP connections per second; I typically end up banning an entire
> /16 or two to recover my VM when it happens.

One of the front-runners in my mind for why I'm not being DDoSed
similarly is that my main house router has a reject list that blocks
misbehaving IPs automatically for a week (currently 16077 IPs, typical
these days).  My border router also has a different list, manually
maintained, which blocks netblocks in three broad categories:

(1) Blocks which appear to think there is such a thing as (in the words
    of one netblock's remarks) "scanning for LEGIT purposes".  Perhaps
    the most notable on this list is UCBerkeley(!).

(2) Blocks which appear to be "please volunteer _your_ resources to
    improve _our_ commercial offerings" outfits.  An example is
    deepfield.net.

(3) Other bad actors.  An example is Digital Ocean, which apparently
    can't be bothered to staff their abuse desk concomitant with the
    level of abuse they emit (ie, trying to get the rest of the net to
    take on some of the costs of their abuse desk - their abuse
    autoresponse indicates that abuse reports not formatted to specs
    they can't be bothered to even point to an explanation of aren't
    read; they handwave "tools such as ...").

Identifiable LLM crawlers would fall into (2) and, in the problematic
cases at hand (misrepresent themselves, no rate limiting, scattershot
from addresses, etc), (3).  This list covers 159611 IPs; its minimal
CIDR representation is 52 blocks.  (That's for IPv4.  The IPv6 list is
20 CIDR blocks covering 795088750969575521173945450512 IPs, not a
useful number; it consists of a /29 and two /32s, with everything else
down in the noise: eight /40s, a /44, two /48s, four /64, and a /124.)

Actually, most offenders of type (1) usually just go into the automated
list, because I don't use the top and bottom addresses of my netblock
for anything but scanner sentinels; anyone trying to access them goes
into the automated list.  Most address-range scanners hit this.  Only
the ones that are visible enough to get human handling ever go into the
manually-maintained list.

Another possible reason is that I don't speak HTTPS; I consider it
plausble the LLM scrapers have drunk the "HTTPS is the One True Way"
koolaid and aren't even trying HTTP.  Some of the port-80 connections
that proceed to send me binary garbage may be attempts to initiate
HTTPS (even though it's the HTTP port); whatever they are, they get
dropped into the automated ban list along with anything else sending
something I don't recognize in the position of an HTTP verb.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mo...@rodents-montreal.org
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Reply via email to