Oh yeah, I forgot to mention backing off and such.

When I first did Web crawling of particular sites around 10 years ago with PLT Scheme, I had to schedule my visits with delays so as not to abuse the site. There was a random component to the scheduling, too.

Note that an off-the-shelf tool might not necessarily work satisfactorily for scraping. In one case, I also had to emulate the 2- or 3-click path that a human user would take through the site to get to the information, because their URLs (and the info behind them!) would change potentially a few times a minute. (Any anti-crawler mechanisms on this particular site were intended to thwart content-stealing competitors, not me.) So, if you find you need natural sequencing for time-sensitive HTTP requests within your scheduling, like I did, and you can think of a really easy way to do that, you might find it more expedient to hack up exactly what you need, rather than evaluate a bunch of off-the-shelf frameworks to see whether any of them will do what you need.

Noel Welsh wrote at 03/18/2011 05:06 PM:
It should be fine. Hundreds of sites is not really that many. You just
need to have backoffs etc. to avoid getting blacklisted. Using sync
and friends would make implementing this easy.

--
http://www.neilvandyke.org/
_________________________________________________
 For list-related administrative tasks:
 http://lists.racket-lang.org/listinfo/users

Reply via email to