Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread László Török
Hi, don't want to turn this to a lengthy discussion about crawling, but happy to continue off list. ;) Sitemaps work surprisingly well in certain domains (web shops powered by standard web shop software, large e-commerce sites) and can make life easier based on our experience. Another point: i n

Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread Michael Klishin
László Török: > I was wondering though how do you make sure two > crawlers do not crawl the same URL twice if there is no global state? :) By adding sharing state, for a single app instance, typically an atom. As for separating different instances, it is not uncommon to hash seed URLs (or domain

Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread László Török
Hi, interesting project. I was wondering though how do you make sure two crawlers do not crawl the same URL twice if there is no global state? :) If I read it correctly you're going to have to spawn a lot of threads to have at least a few busy with extraction at an point in time, as most of them