Re: TokeParser help

david Wed, 19 Nov 2003 16:25:29 -0800

Boris Shor wrote:

> Hello,
> 
> I am a Perl newcomer, and I'm trying to use the TokeParser module to
> extract text from an HTML file. Here's the Perl code:
> 
> use HTML::TokeParser;
> my $p = HTML::TokeParser->new("test.htm");
> while ($p -> get_tag('b'))
>     {
>     print $p -> get_text(),"\n";
>     }
> 
> This works only on bold tags that are not 'inside' other tags.


get_tag and get_text simply return whatever text is at that tag location, it 
doesn't know how to look ahead to skip something and then read the text for 
you. you need to do it yourself:

#!/usr/bin/perl -w
use strict;

use HTML::TokeParser;

#--
#-- you really want to localize the 
#-- following with a block, i am being
#-- a little lazy here for demo.
#--
local $/;

my $bold = 0;
my $text = '';

my $parser = HTML::TokeParser->new(\<DATA>);

while(my $token = $parser->get_token){
        if($token->[0] eq 'S'){
                $text = '' if($token->[1] eq 'b');
        }
        if($token->[0] eq 'E' && $token->[1] eq 'b'){
                print $text,"\n";
                $text = '';
        }
        if($token->[0] eq 'T'){
                $text .= $token->[1];
        }
}

__DATA__

<html>
<body>
<h1>Head 1</h1>
<b>Bolded</b>
<p><b><u>Bolded and underlined</u></b></p>
<p>New line</p>
</body>
</html>

__END__

prints:

Bolded
Bolded and underlined

david
-- 
s,.*,<<,e,y,\n,,d,y,.s,10,,s
.ss.s.s...s.s....ss.....s.ss
s.sssss.sssss...s...s..s....
...s.ss..s.sss..ss.s....ss.s
s.sssss.s.ssss..ss.s....ss.s
..s..sss.sssss.ss.sss..ssss.
..sss....s.s....ss.s....ss.s

,....{4},"|?{*=}_'y!'+0!$&;"
,ge,y,!#:$_(-*[./<[EMAIL PROTECTED],b-t,
.y...,$~=q~=?,;^_#+?{~,,$~=~
y.!-&*-/:[EMAIL PROTECTED] ().;s,;,
);,g,s,s,$~s,g,y,y,%,,g,eval

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokeParser help

Reply via email to