Hello,
I’m looking for advice on handling crawler-driven overload in an Apache
prefork environment.
Environment:
- Apache httpd with prefork MPM
- CentOS 7.4
- ~2 CPU / 4 GB RAM
- prefork must remain in use
Architecture summary:
- Multiple main domains
- Tens of thousands of very small sites, each with its own hostname
- All hostnames are routed through a central VirtualHost using
vhost-level rewrite rules (no .htaccess)
- Each hostname maps dynamically to a directory such as:
/app/sites/{unique-sub-domain-slug}/
Under normal conditions the system behaves well.
Issue:
When Googlebot crawls these small sites, Apache load spikes severely
(load averages > 200). httpd processes grow rapidly and many sites
become unreachable until crawler activity subsides. Main domains remain
responsive during these events.
Steps already taken:
- All rewrite logic moved from .htaccess to VirtualHost
- AllowOverride disabled
- Conservative timeouts and connection limits applied
- Resources increased compared to previous smaller deployment
This same design handled ~150 sites reasonably well in the past. With a
much larger number of sites, overload now happens daily.
My questions:
- Is this a known failure mode of prefork under heavy crawler activity?
- Are there Apache-level techniques to limit crawler impact without
blocking Googlebot?
- In similar setups, what usually becomes the bottleneck first: rewrite
processing, filesystem checks, or process spawning?
Any insight or real-world experience would be greatly appreciated.
Thank you.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]