On 7/25/06, prad <[EMAIL PROTECTED]> wrote:
what is the best way to stop those robots and spiders from getting in?
The sure way to stop robots and spiders is to shut down your web server. I don't suppose that's the answer you're looking for. Treat malicious robots as malicious/unwelcome users. For whatever your definition of malicious, do not expect to be able to easily discern between regular human users and robots. It's too easy to alter user-agent strings, etc to rely on those without precautions (as with all client-generated input).
.htaccess?
That might help, but not solve your problem discerning between human and automated clients. Also, the usual problems/threats regarding credentials will of course apply. Mind you, automated processes (robots) can also use credentials. Possibly you can also use CAPTCHA. Various modules (PHP, Perl) exist that allow to integrate these easily. Whether (or when) robots will be able to fool these tests is another matter.
robot.txt and apache directives?
Well-behaved robots will adhere to measures such as (x)html meta tags, robots.txt files, etc. Other robots may not.
find them on the access_log and block with pf?
Using access_log means you're using information gathered from after the fact.
which are good robots and which are bad?
Apart from robots/spiders potentially being an excellent friend, allowing robots (e.g. Google) may also have undesirable side effects. Such effects range from out-dated information being displayed to search engine users to sensitive data being stored on servers outside your influence. I'm sure there are many more. I'd recommend you think about your threat model first and use that to determine which information you deem sensitive and to what lengths you will go to secure that information. Cheers, Rogier -- If you don't know where you're going, any road will get you there.