On 11 Aug 2014, at 15:49, Sarven Capadisli <[email protected]> wrote: > I briefly brought up something like this to Henry Story for WebIDs. That is, > it'd be cool to encourage the use of WebID's for crawlers, so that, the > server logs would show them in place of User-Agents. That URI could also say > something like "we are crawling these domains, so, yes, it is really us if > you see us in your logs (and not someone pretending)". > > I don't know what the state of that stuff is with WebID. Maybe Kingsley or > Henry can comment further.
yes that seemed like long term a good idea. For WebID see: http://www.w3.org/2005/Incubator/webid/spec/ The WebID Profile could contain info about the type of Agent. The WebID-TLS auth could allow the robot to authenticate. ( Other WebID based authentications to be developed could be used too. you can easily work out from the WebID-TLS what another system of authentication could be ) This would allow one then to create Web Access Control rules where one can allow any robot access in Read Only for a certain type of resource. One could then also attach useage rules ( to be developed ) to the document. Henry > On 2014-08-11 12:11, Hugh Glaser wrote: >> So should we have a class of agents for Linked Data access? >> Can you have a class of agents, or rather can an agent have more than one ID? >> (In particular, can a spider identify as both class and spider instance?) >> Actually ldspider is quite a good class ID :-) >> >> On 10 Aug 2014, at 21:31, Sarven Capadisli <[email protected]> wrote: >> >>> Hi Hugh, just a side discussion >>> >>> Currently I let all the bots have a field day, unless they are clearly >>> abusing or some student didn't get the memo on making reasonably frequent >>> requests. >>> >>> If I were to start blocking the bigger crawlers, Google would be first to >>> go. That's beside the fact that it is possible to control their crawl rate >>> through Webmaster tools. The main reason for me is that, I simply don't see >>> a "return" from them. They don't mind hammering the site if you let them, >>> but try checking all those resources in Google search results - it is a >>> gamble. I have a lot of resources which are statistical observations that >>> don't really differ much from one document to another (at least what most >>> humans or Google would consider). So, any way, I would give SW/LD crawlers >>> the VIP line if I can because they tend to hit sporadically. Which is >>> something I can live with. >>> >>> -Sarven >>> >>> On 2014-08-09 14:17, Hugh Glaser wrote:hye >>>> Hi Tobias, >>>> I have also done the same in >>>> http://sameas.org/robots.txt >>>> (Well, Kingsley said “Yes”, when I asked if I should :-) >>>> >>>> I know it is past time for the spider, but it will happen next time, I >>>> guess. >>>> And it will also open up all the sub-stores >>>> (http://www.sameas.org/store/), such as Sarven’s 270a. >>>> I’m not sure how the sameas.org URIs will work in fact - it may be that >>>> the linkage won’t make it happen, but it will be interesting to see. >>>> Have at them whenever you like :-) >>>> >>>> Very best >>>> Hugh >>>> >>>> On 6 Aug 2014, at 00:01, Tobias Käfer <[email protected]> wrote: >>>> >>>>>> :-) >>>>>> I thought I had done what you suggested: >>>>>> >>>>>> User-agent: ldspider >>>>>> Disallow: >>>>>> Allow: / >>>>>> >>>>>> Which should allow ldspider to crawl the site. >>>>> >>>>> OK, then I got your "No, thank you." line wrong. >>>>> >>>>> But the robots.txt is fine then :) and ldspider will not refrain from >>>>> crawling the site any more. >>>>> >>>>> Btw. One of the two lines -"Allow: /" and "Disallow:"- is sufficient. The >>>>> Disallow line is the older way of putting it, so you might want to remove >>>>> the "Allow: /" line again. >>>>> >>>>> Cheers, >>>>> >>>>> Tobias >>>>> >>>>>> On 5 Aug 2014, at 18:06, Tobias Käfer <[email protected]> wrote: >>>>>> >>>>>>> Hi Hugh, >>>>>>> >>>>>>> sorry for getting you wrong, but still I do not get what behaviour you >>>>>>> want. What you are saying looks different from the robots.txt. If you >>>>>>> tell me how you want it, I can help with the robots.txt (hopefully). >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Tobias >>>>>>> >>>>>>> Am 05.08.2014 um 19:01 schrieb Hugh Glaser: >>>>>>>> Hi Tobias, >>>>>>>> On 5 Aug 2014, at 17:33, Tobias Käfer <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Hugh, >>>>>>>>> >>>>>>>>>> By the way, have I got my robots.txt right? >>>>>>>>>> In particular, is the >>>>>>>>>> User-agent: LDSpider >>>>>>>>>> correct? >>>>>>>>>> Should I worry about case-sensitivity? >>>>>>>>> >>>>>>>>> The library (norbert) that is employed in LDspider is >>>>>>>>> case-insensitive for the user agent. The user agent that is sent is >>>>>>>>> "ldspider". >>>>>>>>> >>>>>>>>> I suppose you want ldspider to crawl your site (highly appreciated), >>>>>>>> No, thank you. >>>>>>>>> so you should change the line in your robots.txt for LDspider to: >>>>>>>>> a) Disallow: >>>>>>>>> b) Allow: / >>>>>>>>> And not leave it with: >>>>>>>>> c) Allow: * >>>>>>>>> The star there does not bring the desired behaviour (and I have not >>>>>>>>> found it in the spec for the path either), in fact, it keeps LDspider >>>>>>>>> from crawling the folders you specified for exclusion for the other >>>>>>>>> crawlers. >>>>>>>> Hopefully it is OK now: >>>>>>>> http://ibm.rkbexplorer.com/robots.txt >>>>>>>> >>>>>>>> Cheers >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Tobias >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>> >> > > Social Web Architect http://bblfish.net/
