On 4/11/19 10:05 AM, Gene Heskett wrote: > On Sunday 03 November 2019 12:37:14 john doe wrote: > >> On 11/3/2019 6:26 PM, Gene Heskett wrote: >>> On Sunday 03 November 2019 11:56:52 Reco wrote: >>>> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote: >>>>> On Sunday 03 November 2019 10:23:50 Reco wrote: >>>>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote: >>>>>>> Greetings all >>>>>>> >>>>>>> I am developing a list of broken webcrawlers who are repeatedly >>>>>>> downloading my entire web site including the hidden stuff. >>>>>>> >>>>>>> These crawlers/bots are ignoring my robots.txt >>>>>> >>>>>> $ wget -O - https://www.shentel.com/robots.txt >>>>>> --2019-11-03 15:22:35-- https://www.shentel.com/robots.txt >>>>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21 >>>>>> Connecting to www.shentel.com >>>>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request >>>>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 >>>>>> ERROR 403: Forbidden. >>>>>> >>>>>> Allowing said bots to *see* your robots.txt would be a step into >>>>>> the right direction. >>>>> >>>>> But you are asking for shentel.com/robots.txt which is my isp. >>>>> You should be asking for >>>>> >>>>> http://geneslinuxbox.net:6309/gene/robots.txt >>>> >>>> Wow. You sir owe me a new set of eyes. >>> >>> Chuckle :) That was the default I'd pickup up from someplace years >>> ago. >>> >>>> I advise you to compare your monstrosity to this (a hint - it does >>>> work) - [1]. >>>> >>>> Reco >>>> >>>> [1] https://enotuniq.net/robots.txt >>> >>> I'll trim mine forthwith to the last entry. I've wondered if that >>> was too long a list. And restart apache2 of course. But now I see >>> the next access is not a 200, but a 404, that not intended. From the >>> access log: >>> >>> coyote.coyote.den:80 209.197.24.34 - - >>> [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 >>> HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; >>> Trident/7.0; rv:11.0) like Gecko" >>> >>> that directory exists, shouldn't that have been a 200? >> >> The directory might exist but it is not accessible. >> >> -- >> John Doe > Universal read perms would be 444. does it need any more than that to be > downloadable?
IIRC all the directories back to / need to be executable as well as readable, by the web server. Richard
signature.asc
Description: OpenPGP digital signature