From experience this stuff is a lot harder and more nuanced than it might seem. Google's agents are well behaved and obey robots.txt. The last high traffic website I worked on had over 250 different web spiders/bots scraping it. That's 250 different user agents that didn't map to a "real" browser. Identifying them required multiple different techniques, looking at request patterns. It's not always obvious which requests are the ones that you want.
Sent from my iPhone > On Jun 22, 2017, at 11:50 AM, li...@lazygranch.com wrote: > > The IP addresses from the Google app aren't those of Google. They are ISPs > generally. > > What bugs me is a fair number of these IP addresses never read my web pages. > Easy enough to see from access.log. They just look for photos. If I served > ads, I would be furious. But what I perceive is Google provides hot linking, > pure and simple. I find it annoying. So now the app is tamed. The can always > click on visit page. > > At one time the Google image search, as run from the browser, would be > blocked if the user clicked on the image. I have the code to stop hot linking > in my conf file. But now Google does some weird thing where the image link is > not to my website, but is some conglomeration of my URL embedded in a google > URL. I assume there is a redirect scheme going on, but the bottom line is the > browser gets the full size image without ever clicking on a html file. > > I try to be as unobtrusive as possible on my website. I don't use Google > analytics. I don't serve ads. Most pages have no Javascript, so you can use > no script if you want. All that said, I'm probably going to set up a scheme > where if the IP hadn't read an html file within a given time period, I will > 403 image requests. I'd like to do it without a session cookie. > > I don't have an issue with the Google bot reading image files for indexing. > What I want is for Google to provide links to the relevant page, not serve > the image directly. > > I've used the Google image search from time to time to judge the user > experience, and it isn't good in general other than finding photos of famous > people. > > Case in point, do a search on the SU-27, which is a plane recently in the > news. You get a lot of SU-35s. Is this really rocket science? I assume Google > has no trust in image tags. But many images have SU-35 in text, which could > be read using openCV, as is done with openALPR. But I'm rambling..... > > > From: Richard Stanway > Sent: Thursday, June 22, 2017 8:03 AM > To: nginx@nginx.org > Reply To: nginx@nginx.org > Subject: Re: block google app > > That user agent doesn't belong to a Google crawler - they are end-user > requests from the Google App (mobile application). I'm not sure what the > motivation is for blocking them but I wouldn't consider it malicious / > unwanted traffic. > >> On Thu, Jun 22, 2017 at 4:47 PM, Jeff Dyke <jeff.d...@gmail.com> wrote: >> I'm glad you found the solution, but being a Google crawler, it would likely >> respect a robots.txt file with Disallow: images/*, which if it worked would >> allow you to remove an if clause from being evaluated on every page load. >> >> You may have already tried it. But i have a feeling you'll start to find >> more that are after this directory. When i was at an image heavy start up, >> we had every one imaginable. >> >> Best, >> Jeff >> >>> On Wed, Jun 21, 2017 at 3:40 PM, li...@lazygranch.com >>> <li...@lazygranch.com> wrote: >>> I'm sending 403 responses now, so I screwed up by mistaking the fields >>> in the logs. I'm going back to lurking mode again with my tail >>> shamefully between my legs. >>> >>> This code in the image location section will block the google app: >>> ------------ >>> if ($http_user_agent ~* (com.google.GoogleMobile)) { >>> return 403; >>> } >>> --------- >>> >>> 403 107.2.5.162 - - [21/Jun/2017:07:21:08 +0000] "GET /images/photo.jpg >>> HTTP/1.1" 140 "-" "com.google.GoogleMobile/28.0.0 iPad/10.3.2 hw/iPad6_7" >>> "-" >>> >>> >>> >>> _______________________________________________ >>> nginx mailing list >>> nginx@nginx.org >>> http://mailman.nginx.org/mailman/listinfo/nginx >> >> >> _______________________________________________ >> nginx mailing list >> nginx@nginx.org >> http://mailman.nginx.org/mailman/listinfo/nginx > > > > _______________________________________________ > nginx mailing list > nginx@nginx.org > http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx