I do it from the JAva and not from the PErl because i need to perform an
insert into the database each time i process a link and also i have to
inform via rss about the progress of the global download process (23.343 out
of 70.000 files have been downloaded) ....




On 1/22/07, Igor Sutton <[EMAIL PROTECTED]> wrote:

Hi Tatiana,

2007/1/22, Tatiana Lloret Iglesias <[EMAIL PROTECTED]>:
> Regarding the performance problem:
>
> The schema of my application is:
>
> 1. I execute perl script which performs a search in a public database.
It
> gets total results in *several pages*. Pressing "Next Page" button (with
> perl script) i get a list of all the links related to my query (70.000more
> or less) I write down all these links in a unique text file.
>
> 2. From the Java i read each of the 70.000 links and i create a new file
> containing the current i'm reading. Then i call a perl script which uses
> this link as input parameter. It browses it and get website content
saving
> it in a local html file.
>
> I'm having performance problems ,,,,  i've tried to don't create a
single
> file containing url for each of the 70.000 links and pass it
automatically
> to perl script as input parameter but it fails...
>
> I've heard about LWP module? do you recomend me to use it??
> Have you ever done something similar to this? can you give me some
advice?
> Thanks
>
> T.
>

I can't see the point you are using Java for that. If your code with
WWW::Mechanize is already working, why don't you do everything in
Perl?

I would do this:

1. Read all links you want to open, it is ok to store it on a single
file IMHO. You can use Tie::File to append lines, it can make your
life easier.

open my $links_file, ">", $filename or die $!;

while (my $link = my_mechanize_get_link()) {
   print {$links_file} $link, "\n";
}

close $links_file or warn $!;

2. After that, you can read from that tied array, from beginning to
end and use LWP::UserAgent or LWP::Simple to retrieve the data you
want to store:

use LWP::Simple;

sub filename_from_url {
   # your code here. logic to compose the filename
   # from url.
}

open my $input, "<", $filename or die $!;
while (my $url = <$input>) {
   chomp($url);
   my $content = get($url);
   if ($content) {
       open my $output, ">", filename_from_url($url) or die $!;
       print {$output} $content;
       close $output or warn $!;
   }
}


HTH!
--
Igor Sutton Lopes <[EMAIL PROTECTED]>

Reply via email to