On Sun, 11 Sep 2005, Mads N. Vestergaard wrote: > I have a few minor problems. > I need to get the content of a website, and search a bit in it. > > I'm using the package called LWP::Simple
Not to complicate things, but have you looked at WWW::Mechanize ? http://search.cpan.org/~petdance/WWW-Mechanize/lib/WWW/Mechanize.pm LWP::Simple is a perfectly good tool, but it very much means that you have to do a lot of the work yourself. With Mech, on the other hand, a lot of the work is pre-bundled for you, so you can work at a higher level without getting involved in as many of the details. The Mech documentation also points out the book _Spidering Hacks_, which also gets in to ways you can automate this kind of work. Related modules: http://search.cpan.org/search?query=www%3A%3Amechanize&mode=all * * * Also, spidering can, by nature, be a bit slow. In fact, that's a good thing -- it's generally considered rude to write a script that quickly swarms over all pages on a remote web server, sucking up all their bandwidth. It's more polite to force your script to run at a slower pace by grabbing URLs no faster than once per minute or so. If you want to be more productive, you can parallelize things by downloading 60 sites at a time, round-robin style, so that each one only gets hit once every minute or so. (Obviously, if there's only one site you're interested in mirroring, then this trick won't help you.) If pacing yourself this way means mirroring a big site takes hours, so be it. The easy way to deal with this is to just leave the script running overnight and do the post-processing later once it's done. -- Chris Devers ô<±^Õ,e/
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>