Hey, Why not just compare their xforward vs connecting ip, if they dont match and its a bot, drop it.
-- Payam Chychi Network Engineer / Security Specialist On Tuesday, May 5, 2015 at 5:38 AM, meteor8488 wrote: > Hi All, > > Recently I found that someguys are trying to mirror my website. They are > doing this in two ways: > > 1. Pretend to be google spiders . Access logs are as following: > > 89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444 > 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; > +http://www.google.com/bot.html)" "66.249.79.138" > 79.85.93.235 - - [05/May/2015:20:23:34 +0800] "GET /robots.txt HTTP/1.0" 444 > 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; > +http://www.google.com/bot.html)" "66.249.79.154" > > The http_x_forwarded_for address are google addresses. > > 2. Pretend to be a normal web browser. > > > I'm trying to use below configuration to block their access: > > > > For 1 above, I'll check X_forward_for address. If user agent is spider, and > X_forward_for is not null. Then block. > I'm using > > map $http_x_forwarded_for $xf { > default 1; > "" 0; > } > map $http_user_agent $fakebots { > default 0; > "~*bot" $xf; > "~*bing" $xf; > "~*search" $xf; > } > if ($fakebots) { > return 444; > } > > With this configuration, it seems the fake google spider can't access the > root of my website. But they can still access my php files, and they can't > access and js or css files. Very strange. I don't know what's wrong. > > 2. For user-agent who declare they are not spiders. I'll use ngx_lua to > generate a random value and add the value into cookie, and then check > whether they can send this value back or not. If they can't send it back, > then it means that they are robot and block access. > > map $http_user_agent $ifbot { > default 0; > "~*Yahoo" 1; > "~*archive" 1; > "~*search" 1; > "~*Googlebot" 1; > "~Mediapartners-Google" 1; > "~*bingbot" 1; > "~*msn" 1; > "~*rogerbot" 3; > "~*ChinasoSpider" 3; > } > > if ($ifbot = "0") { > set $humanfilter 1; > } > #below section is to exclude flash upload > if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) { > set $humanfilter "${humanfilter}1"; > } > > if ($humanfilter = "11"){ > rewrite_by_lua ' > local random = ngx.var.cookie_random > if(random == nil) then > random = math.random(999999) > end > local token = ngx.md5("hello" .. ngx.var.remote_addr .. random) > if (ngx.var.cookie_token ~= token) then > ngx.header["Set-Cookie"] = {"token=" .. token, "random=" .. random} > return ngx.redirect(ngx.var.scheme .. "://" .. ngx.var.host .. > ngx.var.request_uri) > end > '; > } > But it seems that with above configuration, google bot is also blocked while > it shouldn't. > > > Any one can help? > > Thanks > > Posted at Nginx Forum: > http://forum.nginx.org/read.php?2,258659,258659#msg-258659 > > _______________________________________________ > nginx mailing list > nginx@nginx.org > http://mailman.nginx.org/mailman/listinfo/nginx > >
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx