Dan Muey wrote:
> whatever is inbetween the <a tags.
>
> I winder if it's possible to do some thing like this :
>
> if($token->[0] eq 'a'){
> print $token->[1]{href} || "what?","\n";
> my $link_guts = $tok->get_trimmed_text("/a");
>
> and then some how grab the 'src' and 'alt' attributes from each img tag in
> $link_guts if it's an image and the regular text if it's not and probably
> all three if it has an img's and text
>
that's why parsing HTML is tricky and XML is on the way to rescue. is you
use get_token() instead of get_tag(), it might be easier. get_token()
return for all token and it will be the programmer's responsibility to use
the token. get_tag() eats up the tokens you don't want so it's tricky:
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
my $tok = new HTML::TokeParser(*DATA) || die $!;
while(1){
my $token = $tok->get_token();
last unless($token);
if($token->[0] eq 'T'){
print "Text: $token->[1]\n" if($token->[1] =~ /\S/);
}elsif($token->[0] eq 'S' && $token->[1] eq 'img'){
print "IMG $token->[2]{src}\n";
}elsif($token->[0] eq 'S' && $token->[1] eq 'a'){
print "LINK $token->[2]{href}\n";
}
}
__END__
all tokens are returned to you no matter where they are so <img> within <a>,
<a> within <img>, <a> within <a>, etc will all be returned to you. if you
add a little bit more logic, it's easy to find all nesting tags...
david
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]