Oh yeah, I forgot to mention backing off and such.
When I first did Web crawling of particular sites around 10 years ago
with PLT Scheme, I had to schedule my visits with delays so as not to
abuse the site. There was a random component to the scheduling, too.
Note that an off-the-shelf tool might not necessarily work
satisfactorily for scraping. In one case, I also had to emulate the 2-
or 3-click path that a human user would take through the site to get to
the information, because their URLs (and the info behind them!) would
change potentially a few times a minute. (Any anti-crawler mechanisms
on this particular site were intended to thwart content-stealing
competitors, not me.) So, if you find you need natural sequencing for
time-sensitive HTTP requests within your scheduling, like I did, and you
can think of a really easy way to do that, you might find it more
expedient to hack up exactly what you need, rather than evaluate a bunch
of off-the-shelf frameworks to see whether any of them will do what you
need.
Noel Welsh wrote at 03/18/2011 05:06 PM:
It should be fine. Hundreds of sites is not really that many. You just
need to have backoffs etc. to avoid getting blacklisted. Using sync
and friends would make implementing this easy.
--
http://www.neilvandyke.org/
_________________________________________________
For list-related administrative tasks:
http://lists.racket-lang.org/listinfo/users