Re: [racket] data mining business information on web sites w/Racket

Neil Van Dyke Fri, 18 Mar 2011 14:50:53 -0700

Oh yeah, I forgot to mention backing off and such.

When I first did Web crawling of particular sites around 10 years agowith PLT Scheme, I had to schedule my visits with delays so as not toabuse the site. There was a random component to the scheduling, too.

Note that an off-the-shelf tool might not necessarily worksatisfactorily for scraping. In one case, I also had to emulate the 2-or 3-click path that a human user would take through the site to get tothe information, because their URLs (and the info behind them!) wouldchange potentially a few times a minute. (Any anti-crawler mechanismson this particular site were intended to thwart content-stealingcompetitors, not me.) So, if you find you need natural sequencing fortime-sensitive HTTP requests within your scheduling, like I did, and youcan think of a really easy way to do that, you might find it moreexpedient to hack up exactly what you need, rather than evaluate a bunchof off-the-shelf frameworks to see whether any of them will do what youneed.


Noel Welsh wrote at 03/18/2011 05:06 PM:

It should be fine. Hundreds of sites is not really that many. You just
need to have backoffs etc. to avoid getting blacklisted. Using sync
and friends would make implementing this easy.


--
http://www.neilvandyke.org/
_________________________________________________
 For list-related administrative tasks:
 http://lists.racket-lang.org/listinfo/users

Re: [racket] data mining business information on web sites w/Racket

Reply via email to