Re: Scraping

David Romano Thu, 01 Jun 2006 17:15:04 -0700

Hi kc68,

On 6/1/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I'm not getting past printing to the screen and to a file the page in the
script below but without the list of names in the middle.  Without the if
line I get an endless scroll.  I want to be able to pull in all names and
then isolate and print one (e.g. abercrombie).  Guidance and actual script
appreciated.

I'm not certain of what you're trying to do, but hopefully this helps you:

#!/bin/perl


use strict;

use warnings; # shows that PAGE isn't used at all

use WWW::Mechanize;

my $output_dir = "c:/training/bc";

my $starting_url = "http://clerk.house.gov/members/olmbr.html";;

my $browser = WWW::Mechanize->new();

$browser->get( $starting_url );

$browser->submit();

# Looking at the url, there's no form to submit, so I don't think you
need the line above: go straight
# to fetching the contents using $browser->content. Re-read
WWW::Mechanize documentation.

foreach my $line (split(/[\n\r]+/, $browser->content)) {

        if $ line =~ /abercrombie/

        print $browser->content;

}

#The above doesn't even compile for me (there's a space between '$'
and 'line', and there's no
# curly brackets to say you want to print $browser->content when $line
matches). Your regular
# expression (looking the page you're scraping) needs the 'i' modifier
so that letter case doesn't
# matter.  A great resource is http://perldoc.perl.org/perlretut.html
. perlretut will also show you
# other ways  to capture the information you want without having to
use a split to iterate line by
# line.

  open OUT, ">$output_dir/simple2.html" or die "Can't open file:$!";

  print OUT $browser->content;

  close OUT;

# I take it this is for debugging purposes, to make sure the webpage
you scraped was the right
# one?

close PAGE;

# PAGE isn't used anywhere else.

Below is what I think you're basically trying to get done. See if it
works for you:
#!/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

my $output_dir = "c:/training/bc";

my $starting_url = "http://clerk.house.gov/members/olmbr.html";;

my $browser = WWW::Mechanize->new();

$browser->get( $starting_url );

print "$_\n" for ($browser->content =~ /(?<=size=2><i>) [^<]+/gx);
# or maybe
#  for ($browser->content =~ /(?<=size=2><i>) [^<]+/gx) { print "$_\n"
if /abercrombie/i }
# ?

HTH,
David

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Scraping

Reply via email to