Hi,

don't want to turn this to a lengthy discussion about crawling, but happy
to continue off list. ;)

Sitemaps work surprisingly well in certain domains (web shops powered by
standard web shop software, large e-commerce sites) and can make life
easier based on our experience.

Another point: i nice addition would be to observe polite crawling (e.g. do
not retrieve more than one page per sec from a domain), we got banned once
due to excessive traffic from a single IP.

Anyway thanks for sharing, I hope to get some hacking time and implement
one of our extractors as a handler in clojure and take out Itsy for a spin.
:)

Las

2012/6/1 Michael Klishin <michael.s.klis...@gmail.com>

> László Török:
>
> > I was wondering though how do you make sure two
> > crawlers do not crawl the same URL twice if there is no global state? :)
>
> By adding sharing state, for a single app instance, typically an atom. As
> for separating different instances,
> it is not uncommon to hash seed URLs (or domains) in such a way that two
> instances simply won't
> crawl the same site in parallel.
>
>
> > You may also consider using the sitemap as a source of urls per domain,
> > although this depends on the crawling policy.
>
> That does not work in practice. One reason is, sitemaps are often
> incomplete, out of date or missing
> completely. Another one, for most news websites and blogs, you will
> discover site structure a lot
> faster by frequently (within reason, of course) recrawling either first
> level pages or a seed of known
> "section" pages.
>
> There is a really good workshop on Web mining video from Strata Santa
> Clara 2012, it highlights two dozens
> more common problems you face when designing Web crawlers:
>
> http://my.safaribooksonline.com/video/-/9781449336172
>
> Highly recommended for people who are interested or work in this area (I
> think it can be purchased separately, O'Reilly Safari subscribers have
> access to the entire video set)
>
> I am by no means an expert (or even very experienced) in this area but
> Itsy has features that solve several very common
> problems out of the box in 0.1.0. Good job.
>
> MK
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>



-- 
László Török

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to