It should be fine. Hundreds of sites is not really that many. You just
need to have backoffs etc. to avoid getting blacklisted. Using sync
and friends would make implementing this easy.

If you want to extract unstructured data, there is some good reading here:

  http://metaoptimize.com/qa/questions/3440/text-extraction-from-html-pages

Probably roping together various existing systems would be the
efficient way to get a scalable solution working. See Apache Tika and
projects referenced above. There is also a surprising amount of work
on "scalable web spider"s. (Google that phrase if you're interested.)

HTH,
N.

On Fri, Mar 18, 2011 at 7:29 PM, Geoffrey S. Knauth <ge...@knauth.org> wrote:
> I'm evaluating whether to use Racket to data mine hundreds of websites 
> pulling out business information within an industry.  I think Racket is up to 
> it, but I'm wondering if anyone else has had experiences positive or 
> negative.  I've used other tools to do rudimentary digging, but this project 
> is likely to touch AI, which brings me back to the Lisp family.
>
> Geoff

_________________________________________________
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/users

Reply via email to