On Sunday 03 November 2019 12:37:14 john doe wrote: > On 11/3/2019 6:26 PM, Gene Heskett wrote: > > On Sunday 03 November 2019 11:56:52 Reco wrote: > >> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote: > >>> On Sunday 03 November 2019 10:23:50 Reco wrote: > >>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote: > >>>>> Greetings all > >>>>> > >>>>> I am developing a list of broken webcrawlers who are repeatedly > >>>>> downloading my entire web site including the hidden stuff. > >>>>> > >>>>> These crawlers/bots are ignoring my robots.txt > >>>> > >>>> $ wget -O - https://www.shentel.com/robots.txt > >>>> --2019-11-03 15:22:35-- https://www.shentel.com/robots.txt > >>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21 > >>>> Connecting to www.shentel.com > >>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request > >>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 > >>>> ERROR 403: Forbidden. > >>>> > >>>> Allowing said bots to *see* your robots.txt would be a step into > >>>> the right direction. > >>> > >>> But you are asking for shentel.com/robots.txt which is my isp. > >>> You should be asking for > >>> > >>> http://geneslinuxbox.net:6309/gene/robots.txt > >> > >> Wow. You sir owe me a new set of eyes. > > > > Chuckle :) That was the default I'd pickup up from someplace years > > ago. > > > >> I advise you to compare your monstrosity to this (a hint - it does > >> work) - [1]. > >> > >> Reco > >> > >> [1] https://enotuniq.net/robots.txt > > > > I'll trim mine forthwith to the last entry. I've wondered if that > > was too long a list. And restart apache2 of course. But now I see > > the next access is not a 200, but a 404, that not intended. From the > > access log: > > > > coyote.coyote.den:80 209.197.24.34 - - > > [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 > > HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; > > Trident/7.0; rv:11.0) like Gecko" > > > > that directory exists, shouldn't that have been a 200? > > The directory might exist but it is not accessible. > > -- > John Doe Universal read perms would be 444. does it need any more than that to be downloadable?
Thanks John. Cheers, Gene Heskett -- "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) If we desire respect for the law, we must first make the law respectable. - Louis D. Brandeis Genes Web page <http://geneslinuxbox.net:6309/gene>