From experience this stuff is a lot harder and more nuanced than it might seem. 
Google's agents are well behaved and obey robots.txt. The last high traffic 
website I worked on had over 250 different web spiders/bots scraping it. That's 
250 different user agents that didn't map to a "real" browser. Identifying them 
required multiple different techniques, looking at request patterns. It's not 
always obvious which requests are the ones that you want.

Sent from my iPhone

> On Jun 22, 2017, at 11:50 AM, li...@lazygranch.com wrote:
> 
> The IP addresses from the Google app aren't those of Google. They are ISPs 
> generally. 
> 
> What bugs me is a fair number of these IP addresses never read my web pages. 
> Easy enough to see from access.log. They just look for photos. If I served 
> ads, I would be furious. But what I perceive is Google provides hot linking, 
> pure and simple. I find it annoying. So now the app is tamed. The can always 
> click on visit page.
> 
> At one time the Google image search, as run from the browser, would be 
> blocked if the user clicked on the image. I have the code to stop hot linking 
> in my conf file. But now Google does some weird thing where the image link is 
> not to my website, but is some conglomeration of my URL embedded in a google 
> URL. I assume there is a redirect scheme going on, but the bottom line is the 
> browser gets the full size image without ever clicking on a html file.
> 
> I try to be as unobtrusive as possible on my website. I don't use Google 
> analytics. I don't serve ads. Most pages have no Javascript, so you can use 
> no script if you want. All that said, I'm probably going to set up a scheme 
> where if the IP hadn't read an html file within a given time period, I will 
> 403 image requests. I'd like to do it without a session cookie. 
> 
> I don't have an issue with the Google bot reading image files for indexing. 
> What I want is for Google to provide links to the relevant page, not serve 
> the image directly. 
> 
> I've used the Google image search from time to time to judge the user 
> experience, and it isn't good in general other than finding photos of famous 
> people.
> 
> ‎Case in point, do a search on the SU-27, which is a plane recently in the 
> news. You get a lot of SU-35s. Is this really rocket science? I assume Google 
> has no trust in image tags. But many images have SU-35 in text, which could 
> be read using openCV, as is done with openALPR. But I'm rambling.....
> 
> 
> From: Richard Stanway
> Sent: Thursday, June 22, 2017 8:03 AM
> To: nginx@nginx.org
> Reply To: nginx@nginx.org
> Subject: Re: block google app
> 
> That user agent doesn't belong to a Google crawler - they are end-user 
> requests from the Google App (mobile application). I'm not sure what the 
> motivation is for blocking them but I wouldn't consider it malicious / 
> unwanted traffic.
> 
>> On Thu, Jun 22, 2017 at 4:47 PM, Jeff Dyke <jeff.d...@gmail.com> wrote:
>> I'm glad you found the solution, but being a Google crawler, it would likely 
>> respect a robots.txt file with Disallow: images/*, which if it worked would 
>> allow you to remove an if clause from being evaluated on every page load.  
>> 
>> You may have already tried it.  But i have a feeling you'll start to find 
>> more that are after this directory.  When i was at an image heavy start up, 
>> we had every one imaginable.  
>> 
>> Best,
>> Jeff
>> 
>>> On Wed, Jun 21, 2017 at 3:40 PM, li...@lazygranch.com 
>>> <li...@lazygranch.com> wrote:
>>> I'm sending 403 responses now, so I screwed up by mistaking the fields
>>> in the logs. I'm going back to lurking mode again with my tail
>>> shamefully between my legs.
>>> 
>>> This code in the image location section will block the google app:
>>> ------------
>>> if ($http_user_agent ~* (com.google.GoogleMobile)) {
>>>            return 403;
>>>          }
>>> ---------
>>> 
>>> 403 107.2.5.162 - - [21/Jun/2017:07:21:08 +0000] "GET /images/photo.jpg 
>>> HTTP/1.1" 140 "-" "com.google.GoogleMobile/28.0.0 iPad/10.3.2 hw/iPad6_7" 
>>> "-"
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> nginx mailing list
>>> nginx@nginx.org
>>> http://mailman.nginx.org/mailman/listinfo/nginx
>> 
>> 
>> _______________________________________________
>> nginx mailing list
>> nginx@nginx.org
>> http://mailman.nginx.org/mailman/listinfo/nginx
> 
> 
> 
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

Reply via email to