Hi,

interesting project. I was wondering though how do you make sure two
crawlers do not crawl the same URL twice if there is no global state? :)

If I read it correctly you're going to have to spawn a lot of threads to
have at least a few busy with extraction at an point in time, as most of
them will be blocked most of the time while waiting for the page to be
retrieved.

You may also consider using the sitemap as a source of urls per domain,
although this depends on the crawling policy.

Regards,

Laszlo

2012/6/1 Lee Hinman <matthew.hin...@gmail.com>

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi all,
> I'm pleased to announce the initial 0.1.0 release of Itsy. Itsy is a
> threaded web spider written in Clojure. A list of some of the Itsy
> features:
>
> - - Multithreaded, with the ability to add and remove workers as needed
> - - No global state, run multiple crawlers with multiple threads at once
> - - Pre-written handlers for writing to text files and ElasticSearch
> - - Skip URLs that have been seen before
> - - Domain limiting to crawl pages only belonging to a certain domain
>
> You should be able to use it from Clojars[1] with the following:
>
> [itsy "0.1.0"]
>
> Please give it a try and open any issues on the github repo[2] that
> you find. Check out the readme for the full information and usage.
>
> thanks,
> Lee Hinman
>
> [1]: https://clojars.org/itsy
> [2]: https://github.com/dakrone/itsy
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.12 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQIcBAEBAgAGBQJPyDu+AAoJEJ1kZdQ6zsrgseoP/0j4HZ4k0Ok0f2u4HB7xm1kc
> V5oE67kqXKCJq7Nb4LQexbxEwbpQF1u6Zg9o7CvtUeMtLkQXAozIjJkk0H05HtEy
> lXINNupW2AylXnTzMd75E0ydeY+pvyNrG1EY5W1i5CcKhiruNcAQUNxh4UeCmMw2
> G/TBhENW+24KtFEJBd1sum2+o86atMHxvlNruwheLYtzq1iSUbpJe6oZu0EzZaa7
> DlrG1r8Gv77Tgbf+pYtFr+Bpf+ILaojh1lBJwb/8jPbaLwrI/TE4qdnOA+BERn0F
> 7qtNErxq5UBVhrYh9Nit53ZEyDkHLYGWc0P39F8nFfWeWN9C8hAd9GWFddyZw3xL
> eop7IF0XerGdPaM93qfnKDMJLUFGfakBeP4hZIH1k5Ouoou+ffqbIZbzKK4yQwlt
> 9VFKq7z0CsoNQ+sMwPWHjXTqNj62k1DYo1iyGFc0RHLyujuGtOna6ksh10PopIpz
> JxZX+txYXI5MsxLo6zGqHbuartXxhNUtoloYBi3BkD1Knmf5qYR/Irlzcy4TUIov
> QK/UNtvESSapKO/95HUgnw9wi0UDpOLHFTBTFU2XZkvNAalLwMLX9YZwAH79+htY
> C4cKLZjkME7wkvgq/HaMbRsPNuuJN8oBqDpmNzKW2DlJ6TZIdcjlgAVDBFL9oI1+
> mMlBkEVBNGMK+9dWMHur
> =BcLy
> -----END PGP SIGNATURE-----
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en




-- 
László Török

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to