Re: Just what does robots.txt mean for a LOD site?

[email protected] Mon, 11 Aug 2014 07:33:35 -0700

On 11 Aug 2014, at 15:49, Sarven Capadisli <[email protected]> wrote:

> I briefly brought up something like this to Henry Story for WebIDs. That is, 
> it'd be cool to encourage the use of WebID's for crawlers, so that, the 
> server logs would show them in place of User-Agents. That URI could also say 
> something like "we are crawling these domains, so, yes, it is really us if 
> you see us in your logs (and not someone pretending)".
> 
> I don't know what the state of that stuff is with WebID. Maybe Kingsley or 
> Henry can comment further.


yes that seemed like long term a good idea.

For WebID see: http://www.w3.org/2005/Incubator/webid/spec/

The WebID Profile could contain info about the type of Agent.
The WebID-TLS auth could allow the robot to authenticate.
( Other WebID based authentications to be developed could be used too.
  you can easily work out from the WebID-TLS what another system of
  authentication could be )

This would allow one then to create Web Access Control rules where
one can allow any robot access in Read Only for a certain type of resource.
One could then also attach useage rules ( to be developed ) to the document.

Henry


> On 2014-08-11 12:11, Hugh Glaser wrote:
>> So should we have a class of agents for Linked Data access?
>> Can you have a class of agents, or rather can an agent have more than one ID?
>> (In particular, can a spider identify as both class and spider instance?)
>> Actually ldspider is quite a good class ID :-)
>> 
>> On 10 Aug 2014, at 21:31, Sarven Capadisli <[email protected]> wrote:
>> 
>>> Hi Hugh,  just a side discussion
>>> 
>>> Currently I let all the bots have a field day, unless they are clearly 
>>> abusing or some student didn't get the memo on making reasonably frequent 
>>> requests.
>>> 
>>> If I were to start blocking the bigger crawlers, Google would be first to 
>>> go. That's beside the fact that it is possible to control their crawl rate 
>>> through Webmaster tools. The main reason for me is that, I simply don't see 
>>> a "return" from them. They don't mind hammering the site if you let them, 
>>> but try checking all those resources in Google search results - it is a 
>>> gamble. I have a lot of resources which are statistical observations that 
>>> don't really differ much from one document to another (at least what most 
>>> humans or Google would consider). So, any way, I would give SW/LD crawlers 
>>> the VIP line if I can because they tend to hit sporadically. Which is 
>>> something I can live with.
>>> 
>>> -Sarven
>>> 
>>> On 2014-08-09 14:17, Hugh Glaser wrote:hye
>>>> Hi Tobias,
>>>> I have also done the same in
>>>> http://sameas.org/robots.txt
>>>> (Well, Kingsley said “Yes”, when I asked if I should :-)
>>>> 
>>>> I know it is past time for the spider, but it will happen next time, I 
>>>> guess.
>>>> And it will also open up all the sub-stores 
>>>> (http://www.sameas.org/store/), such as Sarven’s 270a.
>>>> I’m not sure how the sameas.org URIs will work in fact - it may be that 
>>>> the linkage won’t make it happen, but it will be interesting to see.
>>>> Have at them whenever you like :-)
>>>> 
>>>> Very best
>>>> Hugh
>>>> 
>>>> On 6 Aug 2014, at 00:01, Tobias Käfer <[email protected]> wrote:
>>>> 
>>>>>> :-)
>>>>>> I thought I had done what you suggested:
>>>>>> 
>>>>>> User-agent: ldspider
>>>>>> Disallow:
>>>>>> Allow: /
>>>>>> 
>>>>>> Which should allow ldspider to crawl the site.
>>>>> 
>>>>> OK, then I got your "No, thank you." line wrong.
>>>>> 
>>>>> But the robots.txt is fine then :) and ldspider will not refrain from 
>>>>> crawling the site any more.
>>>>> 
>>>>> Btw. One of the two lines -"Allow: /" and "Disallow:"- is sufficient. The 
>>>>> Disallow line is the older way of putting it, so you might want to remove 
>>>>> the "Allow: /" line again.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Tobias
>>>>> 
>>>>>> On 5 Aug 2014, at 18:06, Tobias Käfer <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi Hugh,
>>>>>>> 
>>>>>>> sorry for getting you wrong, but still I do not get what behaviour you 
>>>>>>> want. What you are saying looks different from the robots.txt. If you 
>>>>>>> tell me how you want it, I can help with the robots.txt (hopefully).
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Tobias
>>>>>>> 
>>>>>>> Am 05.08.2014 um 19:01 schrieb Hugh Glaser:
>>>>>>>> Hi Tobias,
>>>>>>>> On 5 Aug 2014, at 17:33, Tobias Käfer <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Hi Hugh,
>>>>>>>>> 
>>>>>>>>>> By the way, have I got my robots.txt right?
>>>>>>>>>> In particular, is the
>>>>>>>>>> User-agent: LDSpider
>>>>>>>>>> correct?
>>>>>>>>>> Should I worry about case-sensitivity?
>>>>>>>>> 
>>>>>>>>> The library (norbert) that is employed in LDspider is 
>>>>>>>>> case-insensitive for the user agent. The user agent that is sent is 
>>>>>>>>> "ldspider".
>>>>>>>>> 
>>>>>>>>> I suppose you want ldspider to crawl your site (highly appreciated),
>>>>>>>> No, thank you.
>>>>>>>>> so you should change the line in your robots.txt for LDspider to:
>>>>>>>>> a) Disallow:
>>>>>>>>> b) Allow: /
>>>>>>>>> And not leave it with:
>>>>>>>>> c) Allow: *
>>>>>>>>> The star there does not bring the desired behaviour (and I have not 
>>>>>>>>> found it in the spec for the path either), in fact, it keeps LDspider 
>>>>>>>>> from crawling the folders you specified for exclusion for the other 
>>>>>>>>> crawlers.
>>>>>>>> Hopefully it is OK now:
>>>>>>>> http://ibm.rkbexplorer.com/robots.txt
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Tobias
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 

Social Web Architect
http://bblfish.net/

Re: Just what *does* robots.txt mean for a LOD site?

Reply via email to

Re: Just what does robots.txt mean for a LOD site?