Boris Shor wrote: > Hello, > > I am a Perl newcomer, and I'm trying to use the TokeParser module to > extract text from an HTML file. Here's the Perl code: > > use HTML::TokeParser; > my $p = HTML::TokeParser->new("test.htm"); > while ($p -> get_tag('b')) > { > print $p -> get_text(),"\n"; > } > > This works only on bold tags that are not 'inside' other tags.
get_tag and get_text simply return whatever text is at that tag location, it doesn't know how to look ahead to skip something and then read the text for you. you need to do it yourself: #!/usr/bin/perl -w use strict; use HTML::TokeParser; #-- #-- you really want to localize the #-- following with a block, i am being #-- a little lazy here for demo. #-- local $/; my $bold = 0; my $text = ''; my $parser = HTML::TokeParser->new(\<DATA>); while(my $token = $parser->get_token){ if($token->[0] eq 'S'){ $text = '' if($token->[1] eq 'b'); } if($token->[0] eq 'E' && $token->[1] eq 'b'){ print $text,"\n"; $text = ''; } if($token->[0] eq 'T'){ $text .= $token->[1]; } } __DATA__ <html> <body> <h1>Head 1</h1> <b>Bolded</b> <p><b><u>Bolded and underlined</u></b></p> <p>New line</p> </body> </html> __END__ prints: Bolded Bolded and underlined david -- s,.*,<<,e,y,\n,,d,y,.s,10,,s .ss.s.s...s.s....ss.....s.ss s.sssss.sssss...s...s..s.... ...s.ss..s.sss..ss.s....ss.s s.sssss.s.ssss..ss.s....ss.s ..s..sss.sssss.ss.sss..ssss. ..sss....s.s....ss.s....ss.s ,....{4},"|?{*=}_'y!'+0!$&;" ,ge,y,!#:$_(-*[./<[EMAIL PROTECTED],b-t, .y...,$~=q~=?,;^_#+?{~,,$~=~ y.!-&*-/:[EMAIL PROTECTED] ().;s,;, );,g,s,s,$~s,g,y,y,%,,g,eval -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]