hotkitty wrote:
I ultimately want to go to cnn.com/ politics, follow all links under the "Election Coverage" headline and, w/in those links, save all the links under the "Don't Miss" sections that appear in those stories. However, after many hours and trial & error I've yet to complete the task. I know mechanize can do this somehow but I've yet to figure out how to put it all together.

It's not so much about putting it together; it's more like writing Perl code step by step...

Here's the script I have so far, which gets me to only step one:

http://www.mail-archive.com/beginners%40perl.org/msg93769.html

Actually, I'm not sure that the code you have even gets you to step one.

As a parsing exercise, I wrote the code below. I chose to make use of LWP::Simple and HTML::TokeParser. Please study the docs for the latter: http://search.cpan.org/perldoc?HTML::TokeParser


#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $domain = 'http://edition.cnn.com';
my $uri = $domain . '/POLITICS/';

my $html = get($uri) or die "Fetching $uri failed";
my $p = HTML::TokeParser->new(\$html);

# go to start position in the document
while ( $p->get_tag('div') ) {
    last if $p->get_text eq 'Election coverage';
}

# extract links
my @links;
while ( my $token = $p->get_token ) {
    if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
        push @links, $token->[2]{href};
    }
    last if $token->[0] eq 'E' and $token->[1] eq 'ul';
}

foreach my $uri ( map $domain . $_, @links ) {
    my $html = get($uri) or warn "Fetching $uri failed" and next;
    my $p = HTML::TokeParser->new(\$html);

    # go to start position in the document
    $p->get_tag('h4');
    unless ( $p->get_text eq "Don't Miss" ) {
        warn "Didn't find section \"Don't Miss\"";
        next;
    }

    print "$uri\n";

    # extract links
    while ( my $token = $p->get_token ) {
        if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
            print '  ', $p->get_text, "\n";
            my $uri = substr($token->[2]{href}, 0, 4) eq 'http' ?
              $token->[2]{href} : $domain . $token->[2]{href};
            print "  $uri\n\n";
        }
        last if $token->[0] eq 'E' and $token->[1] eq 'ul';
    }
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to