Ah but I want Google to look, but just return links to pages, not images. There all those hits that pretend to be Google, because hey, why not. ;-) I block a large number of bots simply by the firewall. I started with the obvious like Amazon and all the major hosting companies. I have an extensive "map" that looks for obvious hacking such as logging into WordPress. They all get the nginx 444 code. I have a script that pulls all the IPs that generated 444. That I feed to ip2location.com, selecting data centers. The manual part of this process it to verify that IP really is a data center. Then I use bgp.he.net to get the CIDRs. I generate a section of text to block them in ipfw. This all sounds like work, but it isn't. Most days there is no IP to block. Because I block the entire IP space of the datacenter, I eliminate many other IP addresses that lack eyeballs.
From experience this stuff is a lot harder and more nuanced than it might seem. Google's agents are well behaved and obey robots.txt. The last high traffic website I worked on had over 250 different web spiders/bots scraping it. That's 250 different user agents that didn't map to a "real" browser. Identifying them required multiple different techniques, looking at request patterns. It's not always obvious which requests are the ones that you want. Sent from my iPhone
|
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx