RE: HTML::TokeParser

david Wed, 12 Feb 2003 13:14:22 -0800

Dan Muey wrote:

> whatever is inbetween the <a tags.
> 
> I winder if it's possible to do some thing like this :
> 
> if($token->[0] eq 'a'){
>     print $token->[1]{href} || "what?","\n";
>     my $link_guts = $tok->get_trimmed_text("/a");
> 
> and then some how grab the 'src' and 'alt' attributes from each img tag in
> $link_guts if it's an image and the regular text if it's not and probably
> all three if it has an img's and text
>


that's why parsing HTML is tricky and XML is on the way to rescue. is you 
use get_token() instead of get_tag(), it might be easier. get_token() 
return for all token and it will be the programmer's responsibility to use 
the token. get_tag() eats up the tokens you don't want so it's tricky:

#!/usr/bin/perl -w
use strict;

use HTML::TokeParser;

my $tok = new HTML::TokeParser(*DATA) || die $!;
while(1){

        my $token = $tok->get_token();
        last unless($token);

        if($token->[0] eq 'T'){
                print "Text: $token->[1]\n" if($token->[1] =~ /\S/);
        }elsif($token->[0] eq 'S' && $token->[1] eq 'img'){
                print "IMG $token->[2]{src}\n";
        }elsif($token->[0] eq 'S' && $token->[1] eq 'a'){
                print "LINK $token->[2]{href}\n";
        }
}

__END__

all tokens are returned to you no matter where they are so <img> within <a>,
<a> within <img>, <a> within <a>, etc will all be returned to you. if you 
add a little bit more logic, it's easy to find all nesting tags...

david

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: HTML::TokeParser

Reply via email to