Re: Updated LOD Cloud Diagram - Missed data sources.

Hugh Glaser Fri, 25 Jul 2014 12:58:59 -0700

Very interesting.
On 25 Jul 2014, at 20:12, [email protected] wrote:

> On 25/07/2014 14:44, Hugh Glaser wrote:
>> The idea that having a robots.txt that Disallows spiders
>> is a “problem” for a dataset is rather bizarre.
>> It is of course a problem for the spider, but is clearly not a problem
> for a
>> typical consumer of the dataset.
>> By that measure, serious numbers of the web sites we all use on a daily
>> basis are problematic.
> <snip>
> 
> I think the general interpretation of the robots in "robots.txt" is any
> software agent accessing the site "automatically" (versus a user manually
> entering a URL).
I had never thought this.
My understanding of the agents that should respect the robots.txt is what are 
usually called crawlers or spiders.
Primarily search engines, but also including things that aim to automatically 
get a whole junk of a site.
Of course, there is no de jure standard, but the places I look seem to lean to 
my view.
http://www.robotstxt.org/orig.html
"WWW Robots (also called wanderers or spiders) are programs that traverse many 
pages in the World Wide Web by recursively retrieving linked pages.”
https://en.wikipedia.org/wiki/Web_robot
"Typically, bots perform tasks that are both simple and structurally 
repetitive, at a much higher rate than would be possible for a human alone. “
It’s all about scale and query rate.
So a php script that fetches one URI now and then is not the target for the 
restriction - nor indeed is my shell script that daily fetches a common page I 
want to save on my laptop.


So, I confess, when my system trips over a dbpedia (or any other) URI and does 
follow-your-nose to get the RDF, it doesn’t check that the site robots.txt 
allows it.
And I certainly don’t expect Linked Data consumers doing simple URI resolution 
to check my robots.txt

But you are right, if I am wrong - robots.txt would make no sense in the Linked 
Data world, since pretty much by definition it will always be an agent doing 
the access.
But then I think we really need a convention (User-agent: ?) that lets me tell 
search engines to stay away, while allowing LD apps to access the stuff they 
want.

Best
Hugh
> 
> If we agree on that interpretation, a robots.txt blacklist prevents
> applications from following links to your site. In that case, my
> counter-question would be: what is the benefit of publishing your content
> as Linked Data (with dereferenceable URIs and rich links) if you
> subsequently prevent machines from discovering and accessing it
> automatically? Essentially you are requesting that humans (somehow) have
> to manually enter every URI/URL for every source, which is precisely the
> document-centric view we're trying to get away from.
> 
> Put simply, as far as I can see, a dereferenceable URI behind a robots.txt
> blacklist is no longer a dereferenceable URI ... at least for a respectful
> software agent. Linked Data behind a robots.txt blacklist is no longer
> Linked Data.
> 
> (This is quite clear in my mind but perhaps others might disagree.)
> 
> Best,
> Aidan
> 
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Re: Updated LOD Cloud Diagram - Missed data sources.

Reply via email to