Apologise for subject/thread hijacking. On 9/19/12 10:13 AM, t...@lists.grepular.com wrote: > On 19/09/12 06:36, grarpamp wrote: > > >> People use robots.txt to indicate that they don't want their site > >> to be added to indexes. > > > They use it to indicate that they don't want their site to be > > crawled. > > In almost all cases (99% or higher), robots.txt is used to indicate > that a site shouldn't be crawled, *because* they don't want it to be > indexed. The intention is painfully clear...
The point has been integrated in the appropriate ticket there: https://github.com/globaleaks/Tor2web-3.0/issues/19 Please integrate here any idea or suggestion about the topic. However you should also know that already today is possible for a TorHs to block access from Tor2web. Tor2web send an X-Tor2web header to announce to the TorHS that connection come from Tor2web. We added up a wiki documentation section explaining how to do it: https://github.com/globaleaks/Tor2web-3.0/wiki/Blocking-access-from-tor2web Regarding the topic of "robots.txt", in the new tor2web 3.0 robots.txt are "hijacked" in order to prevent Tor2web crawling by public search engine. Also a list of user agent of internet spyder has been blocked by default. Both blocks settings can be disabled from config file: https://github.com/globaleaks/Tor2web-3.0/wiki/Configuring-tor2web Those blocks will be probably less annoying when the behavior regarding spidering will be configurable directly from TorHs sites (for example by providing specific tor2web related config strings in robots.txt). Fabio p.s. There's a new tor2web domain using Tor2web 3 http://eqt5g4fuenphqinx.tor2web.blutmagie.de :-) _______________________________________________ tor-talk mailing list tor-talk@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-talk