Re: Scraping

kc68 Thu, 01 Jun 2006 18:48:50 -0700

On Thu, 01 Jun 2006 20:14:36 -0400, David Romano <[EMAIL PROTECTED]>wrote:

Hi kc68,


On 6/1/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I'm not getting past printing to the screen and to a file the page inthescript below but without the list of names in the middle. Without theifline I get an endless scroll. I want to be able to pull in all namesandthen isolate and print one (e.g. abercrombie). Guidance and actualscript
appreciated.

I'm not certain of what you're trying to do, but hopefully this helpsyou:

#!/bin/perl

use strict;

use warnings; # shows that PAGE isn't used at all

use WWW::Mechanize;

my $output_dir = "c:/training/bc";

my $starting_url = "http://clerk.house.gov/members/olmbr.html";;

my $browser = WWW::Mechanize->new();

$browser->get( $starting_url );

$browser->submit();

# Looking at the url, there's no form to submit, so I don't think you
need the line above: go straight
# to fetching the contents using $browser->content. Re-read
WWW::Mechanize documentation.

foreach my $line (split(/[\n\r]+/, $browser->content)) {

        if $ line =~ /abercrombie/

        print $browser->content;

}

#The above doesn't even compile for me (there's a space between '$'
and 'line', and there's no
# curly brackets to say you want to print $browser->content when $line
matches). Your regular
# expression (looking the page you're scraping) needs the 'i' modifier
so that letter case doesn't
# matter.  A great resource is http://perldoc.perl.org/perlretut.html
. perlretut will also show you
# other ways  to capture the information you want without having to
use a split to iterate line by
# line.

  open OUT, ">$output_dir/simple2.html" or die "Can't open file:$!";

  print OUT $browser->content;

  close OUT;

# I take it this is for debugging purposes, to make sure the webpage
you scraped was the right
# one?

close PAGE;

# PAGE isn't used anywhere else.

Below is what I think you're basically trying to get done. See if it
works for you:
#!/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

my $output_dir = "c:/training/bc";

my $starting_url = "http://clerk.house.gov/members/olmbr.html";;

my $browser = WWW::Mechanize->new();

$browser->get( $starting_url );

print "$_\n" for ($browser->content =~ /(?<=size=2><i>) [^<]+/gx);
# or maybe
#  for ($browser->content =~ /(?<=size=2><i>) [^<]+/gx) { print "$_\n"
if /abercrombie/i }
# ?

HTH,
David


***********

The second option worked to print Abercrombie, Neil to the screen. Stillworking on basic concepts. The split construction was suggested bysomeone as a way to get to pulling in all listings and ultimately allvotes. Can you complete that logic to return all lines withrepresentatives' names? Among my points of confusion, when is the printcommand within braces and when is it outside braces?


Ken



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Scraping

Reply via email to