At 05:04 PM 3/13/2013, Dan McCullough wrote
:
>Web bots can ignore the robots.txt file, most scrapers would.

and at 05:06 PM 3/13/2013, Marc Guay wrote:

>These don't sound like robots that would respect a txt file to me.

Dan and Marc are correct. Although I used the terms "spiders" and "pirates" I 
believe that the correct term, as employed by Dan, is "scrapers," and that 
twerm might be applied to either the robot or the site which displays its 
results. One blogger has called scrapers "the arterial plaque of the Internet." 
I need to implement a solution that allows humans to access my files but 
prevents scrapers from accessing them. I will undoubtedly have to implement 
some type of challenge-and-response in the system (such as a captcha), but as 
long as those files are stored below the web root a scraper that has a valid 
URL can probably grab them. That is part of what the "public" in public_html 
implies.

One of the reasons why this irks me is that the scrapers are all commercial 
sites, but they haven't offered me a piece of the action for the use of my 
files. My domain is an entirely non-commercial domain, and I provide free 
hosting for other non-commercial genealogical works, primarily pages that are 
part of the USGenWeb Project, which is perhaps the largest of all 
non-commercial genealogical projects.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to