Ah but I want Google to look, but just return links to pages, not images. 

There all those ‎hits that pretend to be Google, because hey, why not. ;-) 

I block a large number of bots simply by the firewall. I started with the obvious like Amazon and all the major hosting companies. I have an extensive "map" that looks for obvious hacking such as logging into WordPress. They all get the nginx 444 code. I have a script that pulls all the IPs that generated 444. That I feed to ip2location.com, selecting data centers. The manual part of this process it to verify that IP really is a data center. Then I use bgp.he.net to get the CIDRs. I generate a section of text to block them in ipfw. 

This all sounds like work, but it isn't. Most days there is no IP to block. Because I block the entire IP space of the datacenter, I eliminate many other IP addresses that lack eyeballs.

From: Peter Booth
Sent: Thursday, June 22, 2017 12:17 PM
To: nginx@nginx.org
Reply To: nginx@nginx.org
Subject: Re: block google app

From experience this stuff is a lot harder and more nuanced than it might seem. Google's agents are well behaved and obey robots.txt. The last high traffic website I worked on had over 250 different web spiders/bots scraping it. That's 250 different user agents that didn't map to a "real" browser. Identifying them required multiple different techniques, looking at request patterns. It's not always obvious which requests are the ones that you want.

Sent from my iPhone

On Jun 22, 2017, at 11:50 AM, li...@lazygranch.com wrote:

The IP addresses from the Google app aren't those of Google. They are ISPs generally. 

What bugs me is a fair number of these IP addresses never read my web pages. Easy enough to see from access.log. They just look for photos. If I served ads, I would be furious. But what I perceive is Google provides hot linking, pure and simple. I find it annoying. So now the app is tamed. The can always click on visit page.

At one time the Google image search, as run from the browser, would be blocked if the user clicked on the image. I have the code to stop hot linking in my conf file. But now Google does some weird thing where the image link is not to my website, but is some conglomeration of my URL embedded in a google URL. I assume there is a redirect scheme going on, but the bottom line is the browser gets the full size image without ever clicking on a html file.

I try to be as unobtrusive as possible on my website. I don't use Google analytics. I don't serve ads. Most pages have no _javascript_, so you can use no script if you want. All that said, I'm probably going to set up a scheme where if the IP hadn't read an html file within a given time period, I will 403 image requests. I'd like to do it without a session cookie. 

I don't have an issue with the Google bot reading image files for indexing. What I want is for Google to provide links to the relevant page, not serve the image directly. 

I've used the Google image search from time to time to judge the user experience, and it isn't good in general other than finding photos of famous people.

‎Case in point, do a search on the SU-27, which is a plane recently in the news. You get a lot of SU-35s. Is this really rocket science? I assume Google has no trust in image tags. But many images have SU-35 in text, which could be read using openCV, as is done with openALPR. But I'm rambling.....


From: Richard Stanway
Sent: Thursday, June 22, 2017 8:03 AM
Reply To: nginx@nginx.org
Subject: Re: block google app

That user agent doesn't belong to a Google crawler - they are end-user requests from the Google App (mobile application). I'm not sure what the motivation is for blocking them but I wouldn't consider it malicious / unwanted traffic.

On Thu, Jun 22, 2017 at 4:47 PM, Jeff Dyke <jeff.d...@gmail.com> wrote:
I'm glad you found the solution, but being a Google crawler, it would likely respect a robots.txt file with Disallow: images/*, which if it worked would allow you to remove an if clause from being evaluated on every page load.  

You may have already tried it.  But i have a feeling you'll start to find more that are after this directory.  When i was at an image heavy start up, we had every one imaginable.  

Best,
Jeff

On Wed, Jun 21, 2017 at 3:40 PM, li...@lazygranch.com <li...@lazygranch.com> wrote:
I'm sending 403 responses now, so I screwed up by mistaking the fields
in the logs. I'm going back to lurking mode again with my tail
shamefully between my legs.

This code in the image location section will block the google app:
------------
if ($http_user_agent ~* (com.google.GoogleMobile)) {
           return 403;
         }
---------

403 107.2.5.162 - - [21/Jun/2017:07:21:08 +0000] "GET /images/photo.jpg HTTP/1.1" 140 "-" "com.google.GoogleMobile/28.0.0 iPad/10.3.2 hw/iPad6_7" "-"



_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx


_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx



_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

Reply via email to