On Sunday 03 November 2019 16:51:16 Richard Hector wrote: > On 4/11/19 10:05 AM, Gene Heskett wrote: > > On Sunday 03 November 2019 12:37:14 john doe wrote: > >> On 11/3/2019 6:26 PM, Gene Heskett wrote: > >>> On Sunday 03 November 2019 11:56:52 Reco wrote: > >>>> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote: > >>>>> On Sunday 03 November 2019 10:23:50 Reco wrote: > >>>>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote: > >>>>>>> Greetings all > >>>>>>> > >>>>>>> I am developing a list of broken webcrawlers who are > >>>>>>> repeatedly downloading my entire web site including the hidden > >>>>>>> stuff. > >>>>>>> > >>>>>>> These crawlers/bots are ignoring my robots.txt > >>>>>> > >>>>>> $ wget -O - https://www.shentel.com/robots.txt > >>>>>> --2019-11-03 15:22:35-- https://www.shentel.com/robots.txt > >>>>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21 > >>>>>> Connecting to www.shentel.com > >>>>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request > >>>>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 > >>>>>> ERROR 403: Forbidden. > >>>>>> > >>>>>> Allowing said bots to *see* your robots.txt would be a step > >>>>>> into the right direction. > >>>>> > >>>>> But you are asking for shentel.com/robots.txt which is my isp. > >>>>> You should be asking for > >>>>> > >>>>> http://geneslinuxbox.net:6309/gene/robots.txt > >>>> > >>>> Wow. You sir owe me a new set of eyes. > >>> > >>> Chuckle :) That was the default I'd pickup up from someplace years > >>> ago. > >>> > >>>> I advise you to compare your monstrosity to this (a hint - it > >>>> does work) - [1]. > >>>> > >>>> Reco > >>>> > >>>> [1] https://enotuniq.net/robots.txt > >>> > >>> I'll trim mine forthwith to the last entry. I've wondered if that > >>> was too long a list. And restart apache2 of course. But now I see > >>> the next access is not a 200, but a 404, that not intended. From > >>> the access log: > >>> > >>> coyote.coyote.den:80 209.197.24.34 - - > >>> [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 > >>> HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; > >>> Trident/7.0; rv:11.0) like Gecko" > >>> > >>> that directory exists, shouldn't that have been a 200? > >> > >> The directory might exist but it is not accessible. > >> > >> -- > >> John Doe > > > > Universal read perms would be 444. does it need any more than that > > to be downloadable? > > IIRC all the directories back to / need to be executable as well as > readable, by the web server. > > Richard
Thats impossible here as its running in a ownership sandbox all owned only by apache2. OTOH, as far as apache2 is concerned / is the /var/www/html directory, and everything beyond that is owned by the same group apache2 is a member of. AIUI, in robots.txt, a more permissive rule above the one that apparently has most locked out. What I'd need is a test, probably not User-agent, but that would match and allow the browsers + curl and wget from normal users in. What would that rule look like? Or would that come under User-agents too? These bots all seem to use GET. Thanks Richard. Cheers, Gene Heskett -- "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) If we desire respect for the law, we must first make the law respectable. - Louis D. Brandeis Genes Web page <http://geneslinuxbox.net:6309/gene>