Hi, don't want to turn this to a lengthy discussion about crawling, but happy to continue off list. ;)
Sitemaps work surprisingly well in certain domains (web shops powered by standard web shop software, large e-commerce sites) and can make life easier based on our experience. Another point: i nice addition would be to observe polite crawling (e.g. do not retrieve more than one page per sec from a domain), we got banned once due to excessive traffic from a single IP. Anyway thanks for sharing, I hope to get some hacking time and implement one of our extractors as a handler in clojure and take out Itsy for a spin. :) Las 2012/6/1 Michael Klishin <michael.s.klis...@gmail.com> > László Török: > > > I was wondering though how do you make sure two > > crawlers do not crawl the same URL twice if there is no global state? :) > > By adding sharing state, for a single app instance, typically an atom. As > for separating different instances, > it is not uncommon to hash seed URLs (or domains) in such a way that two > instances simply won't > crawl the same site in parallel. > > > > You may also consider using the sitemap as a source of urls per domain, > > although this depends on the crawling policy. > > That does not work in practice. One reason is, sitemaps are often > incomplete, out of date or missing > completely. Another one, for most news websites and blogs, you will > discover site structure a lot > faster by frequently (within reason, of course) recrawling either first > level pages or a seed of known > "section" pages. > > There is a really good workshop on Web mining video from Strata Santa > Clara 2012, it highlights two dozens > more common problems you face when designing Web crawlers: > > http://my.safaribooksonline.com/video/-/9781449336172 > > Highly recommended for people who are interested or work in this area (I > think it can be purchased separately, O'Reilly Safari subscribers have > access to the entire video set) > > I am by no means an expert (or even very experienced) in this area but > Itsy has features that solve several very common > problems out of the box in 0.1.0. Good job. > > MK > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- László Török -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en