On Tue, 11 Apr 2006 18:12:16 -0400, <[EMAIL PROTECTED]> wrote:
I am slowly making my way through the process of scraping the data
behind a form and can now get five results plus a series of links using
the script below. I need help in doing the following: 1) Eliminating
all material on the page other than the list and the links (and
ultimately eliminate the link numbers); 2) following the links so that
the five listings behind each link are returned; 3) Returning the
results for all states (i.e. all listings) rather than just Ohio. From
my tutorial it looks like I need foreach my $link
($browser->find_aal_links( url_regex => SOMETHING)){ - and that the
something is based on the url that appears upon executing a link. But
from there I'm stumped. The url of the form is in the script below.
Thanks in advance.
Ken
use strict;
use WWW::Mechanize;
my $output_dir = "c:/training/bc/";
my $starting_url =
"http://www.theblackchurchpage.com/modules.php?name=Locator";
my $browser = WWW::Mechanize->new();
$browser->get( $starting_url );
$browser->form_number( 3 );
$browser->field( "church_state", "OH" );
$browser->submit();
{
open OUT, ">$output_dir/bc7.xls" or die "Can't open file: $!";
print OUT $browser->content;
# close OUT;
}
close PAGE;
print $browser->content;
I haven't seen a response to my questions (above) posted yesterday. Not
clear? Not possible? Too hard? Too obvious? To elaborate on the first
question: in the tutorial example, the foreach my $link does a regex
qr/cd.asp/ based on a url that is
homepage.com/sub/sub1/sub2/cd.asp?I=500031
The url at a link on the page I'm trying to scrape is
http://www.theblackchurchpage.com/modules.php?name=Locator&op=search&pnum=2&ccount=5&offset=5&church_name=&church_state=oh&church_city=&church_denom=&church_pastor=
I don't see how to make a regex from that. Anything on any of the
questions would help.
Ken
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>