hotkitty wrote:
I ultimately want to go to cnn.com/ politics, follow all links under
the "Election Coverage" headline and, w/in those links, save all the
links under the "Don't Miss" sections that appear in those stories.
However, after many hours and trial & error I've yet to complete the
task. I know mechanize can do this somehow but I've yet to figure out
how to put it all together.
It's not so much about putting it together; it's more like writing Perl
code step by step...
Here's the script I have so far, which gets me to only step one:
http://www.mail-archive.com/beginners%40perl.org/msg93769.html
Actually, I'm not sure that the code you have even gets you to step one.
As a parsing exercise, I wrote the code below. I chose to make use of
LWP::Simple and HTML::TokeParser. Please study the docs for the latter:
http://search.cpan.org/perldoc?HTML::TokeParser
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;
my $domain = 'http://edition.cnn.com';
my $uri = $domain . '/POLITICS/';
my $html = get($uri) or die "Fetching $uri failed";
my $p = HTML::TokeParser->new(\$html);
# go to start position in the document
while ( $p->get_tag('div') ) {
last if $p->get_text eq 'Election coverage';
}
# extract links
my @links;
while ( my $token = $p->get_token ) {
if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
push @links, $token->[2]{href};
}
last if $token->[0] eq 'E' and $token->[1] eq 'ul';
}
foreach my $uri ( map $domain . $_, @links ) {
my $html = get($uri) or warn "Fetching $uri failed" and next;
my $p = HTML::TokeParser->new(\$html);
# go to start position in the document
$p->get_tag('h4');
unless ( $p->get_text eq "Don't Miss" ) {
warn "Didn't find section \"Don't Miss\"";
next;
}
print "$uri\n";
# extract links
while ( my $token = $p->get_token ) {
if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
print ' ', $p->get_text, "\n";
my $uri = substr($token->[2]{href}, 0, 4) eq 'http' ?
$token->[2]{href} : $domain . $token->[2]{href};
print " $uri\n\n";
}
last if $token->[0] eq 'E' and $token->[1] eq 'ul';
}
}
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/