There's no guarantee that crawlers will be polite and honor robots.txt
directives; the search-engine ones probably do, but the spammers' ones
definitely don't and in fact probably pay special attention to what's excluded.
(I have a honeypot entry in my robots.txt designed to catch and then block
the malicious robots.) OTOH, since the user-agent data is also only as reliable
as the intent of whoever sets the crawler up, filtering based on that may not be
much help either. I seem to recall having read somewhere that it's possible to
configure Apache to recognize "executables" independent of the OS and file
extensions and associations? If that's true, perhaps that might lead to some
solution to your problem.
Mark
-------- Original Message --------
Subject: [EMAIL PROTECTED] Blocking crawling of CGIs
From: Tony Rice (trice) <[EMAIL PROTECTED]>
To: users@httpd.apache.org
Date: Tuesday, September 18, 2007 11:24:20 AM
We've had some instances where crawlers have stumbled onto a cgi script
which refers to itself and start pounding the server with requests to
that cgi.
There are so many CGI scripts on this server that I don't want to
maintain a huge robots.txt file. Any suggestions on other techniques to
keep crawlers away from cgi scripts? Check the browser with
BrowserMatch and then do something creative with "deny from env="?
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
" from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]