On Wed, Feb 25, 2026 at 5:47 AM Phong Thai <[email protected]> wrote:
> Yes, I’m aware of fake Googlebot traffic and that is a valid concern. > > I’m verifying Googlebot using reverse DNS (crawl-*.googlebot.com), > and I do see both real Googlebot and a significant amount of > cloud-provider traffic (Azure/AWS) spoofing UA. > > I already filter a large portion of obvious bot traffic at the > network / firewall level, but the overload still occurs specifically > when legitimate crawlers hit many hostnames in parallel. > > The difficulty is that with prefork, processes are spawned very early > during vhost and rewrite evaluation, so even valid crawlers can > exhaust memory before any application-level throttling applies. > > I’m trying to understand if there are Apache-level techniques > to reduce rewrite / vhost routing cost per request, > without blocking or misleading real Googlebot. > > On Wed, Feb 25, 2026, 4:15 PM Marc <[email protected]> wrote: > >> Are you sure it is googlebot and not fake bots. I have 400k requests per >> day from shit Microsoft Azure mostly. Try and filter out the crap, so you >> have more resources left for real traffic. >> >> I have a honeypot page with robots.txt, >> 5m cron, everything >100 requests goes to ipset blacklist. >> Everthing blacklisted is redicted to lightweight html only page. >> >> I think this only works on ipv4 as these are not abundant. >> >> PS. maybe crawl delay in robots? >> PPS. upgrading also helps with performance >> >> >> > >> > >> > Hello, >> > >> > I’m looking for advice on handling crawler-driven overload in an >> > Apache >> > prefork environment. >> > >> > Environment: >> > - Apache httpd with prefork MPM >> > - CentOS 7.4 >> > - ~2 CPU / 4 GB RAM >> > - prefork must remain in use >> > >> > Architecture summary: >> > - Multiple main domains >> > - Tens of thousands of very small sites, each with its own >> hostname >> > - All hostnames are routed through a central VirtualHost using >> > vhost-level rewrite rules (no .htaccess) >> > - Each hostname maps dynamically to a directory such as: >> > /app/sites/{unique-sub-domain-slug}/ >> > >> > Under normal conditions the system behaves well. >> > >> > Issue: >> > When Googlebot crawls these small sites, Apache load spikes >> > severely >> > (load averages > 200). httpd processes grow rapidly and many sites >> > become unreachable until crawler activity subsides. Main domains >> > remain >> > responsive during these events. >> > >> > Steps already taken: >> > - All rewrite logic moved from .htaccess to VirtualHost >> > - AllowOverride disabled >> > - Conservative timeouts and connection limits applied >> > - Resources increased compared to previous smaller deployment >> > >> > This same design handled ~150 sites reasonably well in the past. >> > With a >> > much larger number of sites, overload now happens daily. >> > >> > My questions: >> > - Is this a known failure mode of prefork under heavy crawler >> > activity? >> > - Are there Apache-level techniques to limit crawler impact >> without >> > blocking Googlebot? >> > - In similar setups, what usually becomes the bottleneck first: >> > rewrite >> > processing, filesystem checks, or process spawning? >> > >> > Any insight or real-world experience would be greatly appreciated. >> > The solution is really to use the event mpm here - why are you bound to use the prefork approach? With prefork, the only way to scale is to pre-spawn tons of workers up to 80% of your available memory for httpd, and make sure the processes are not killed.
