Are you sure it is googlebot and not fake bots. I have 400k requests per day
from shit Microsoft Azure mostly. Try and filter out the crap, so you have more
resources left for real traffic.
I have a honeypot page with robots.txt,
5m cron, everything >100 requests goes to ipset blacklist.
Everthing blacklisted is redicted to lightweight html only page.
I think this only works on ipv4 as these are not abundant.
PS. maybe crawl delay in robots?
PPS. upgrading also helps with performance
>
>
> Hello,
>
> I’m looking for advice on handling crawler-driven overload in an
> Apache
> prefork environment.
>
> Environment:
> - Apache httpd with prefork MPM
> - CentOS 7.4
> - ~2 CPU / 4 GB RAM
> - prefork must remain in use
>
> Architecture summary:
> - Multiple main domains
> - Tens of thousands of very small sites, each with its own hostname
> - All hostnames are routed through a central VirtualHost using
> vhost-level rewrite rules (no .htaccess)
> - Each hostname maps dynamically to a directory such as:
> /app/sites/{unique-sub-domain-slug}/
>
> Under normal conditions the system behaves well.
>
> Issue:
> When Googlebot crawls these small sites, Apache load spikes
> severely
> (load averages > 200). httpd processes grow rapidly and many sites
> become unreachable until crawler activity subsides. Main domains
> remain
> responsive during these events.
>
> Steps already taken:
> - All rewrite logic moved from .htaccess to VirtualHost
> - AllowOverride disabled
> - Conservative timeouts and connection limits applied
> - Resources increased compared to previous smaller deployment
>
> This same design handled ~150 sites reasonably well in the past.
> With a
> much larger number of sites, overload now happens daily.
>
> My questions:
> - Is this a known failure mode of prefork under heavy crawler
> activity?
> - Are there Apache-level techniques to limit crawler impact without
> blocking Googlebot?
> - In similar setups, what usually becomes the bottleneck first:
> rewrite
> processing, filesystem checks, or process spawning?
>
> Any insight or real-world experience would be greatly appreciated.