László Török: > I was wondering though how do you make sure two > crawlers do not crawl the same URL twice if there is no global state? :)
By adding sharing state, for a single app instance, typically an atom. As for separating different instances, it is not uncommon to hash seed URLs (or domains) in such a way that two instances simply won't crawl the same site in parallel. > You may also consider using the sitemap as a source of urls per domain, > although this depends on the crawling policy. That does not work in practice. One reason is, sitemaps are often incomplete, out of date or missing completely. Another one, for most news websites and blogs, you will discover site structure a lot faster by frequently (within reason, of course) recrawling either first level pages or a seed of known "section" pages. There is a really good workshop on Web mining video from Strata Santa Clara 2012, it highlights two dozens more common problems you face when designing Web crawlers: http://my.safaribooksonline.com/video/-/9781449336172 Highly recommended for people who are interested or work in this area (I think it can be purchased separately, O'Reilly Safari subscribers have access to the entire video set) I am by no means an expert (or even very experienced) in this area but Itsy has features that solve several very common problems out of the box in 0.1.0. Good job. MK -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en