Re: Fetch the content of a website

Chris Devers Sun, 11 Sep 2005 08:59:10 -0700

On Sun, 11 Sep 2005, Mads N. Vestergaard wrote:

> I have a few minor problems.
> I need to get the content of a website, and search a bit in it.
> 
> I'm using the package called LWP::Simple


Not to complicate things, but have you looked at WWW::Mechanize ?

http://search.cpan.org/~petdance/WWW-Mechanize/lib/WWW/Mechanize.pm

LWP::Simple is a perfectly good tool, but it very much means that you 
have to do a lot of the work yourself. With Mech, on the other hand, a 
lot of the work is pre-bundled for you, so you can work at a higher 
level without getting involved in as many of the details.

The Mech documentation also points out the book _Spidering Hacks_, which 
also gets in to ways you can automate this kind of work.

Related modules:

http://search.cpan.org/search?query=www%3A%3Amechanize&mode=all

 * * *

Also, spidering can, by nature, be a bit slow. In fact, that's a good 
thing -- it's generally considered rude to write a script that quickly 
swarms over all pages on a remote web server, sucking up all their 
bandwidth. 

It's more polite to force your script to run at a slower pace by 
grabbing URLs no faster than once per minute or so. If you want to be 
more productive, you can parallelize things by downloading 60 sites at a 
time, round-robin style, so that each one only gets hit once every 
minute or so. (Obviously, if there's only one site you're interested in 
mirroring, then this trick won't help you.)

If pacing yourself this way means mirroring a big site takes hours, so 
be it. The easy way to deal with this is to just leave the script 
running overnight and do the post-processing later once it's done.



-- 
Chris Devers

ô<±^Õ,e/

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Fetch the content of a website

Reply via email to