David Eason wrote: > > John W. Krahn wrote: > > According to HTML::Entities > > > > # Some extra Latin 1 chars that are listed in the HTML3.2 draft > > (21-May-96) > > copy => '©', # copyright sign > > reg => '®', # registered sign > > nbsp => "\240", # non breaking space > > Thanks, John, I had no idea where to look. I didn't know a non-breaking > space was an actual character, I thought it was just a directive to the > browser.
AFAIK it is. > I have corrected the code below accordingly and it prints "line > 1line 3" as desired. FWIW on my computer "\240" prints a "space". :-) > use strict; > use warnings; > use HTML::TokeParser; > > my $p = HTML::TokeParser->new(*DATA) or die "Can't open: $!"; > while (my $tag = $p->get_tag()) > { > if ($tag->[0] eq "dd") > { > my $text = $p->get_trimmed_text(); > $text =~ s/^[\s\240]*(.*?)[\s\240]*$/$1/; If you are going to do that then you might as well call get_text and do all the trimming yourself. my $text = $p->get_text(); for ( $text ) { s/^[\s\240]+//; s/[\s\240]+$//; s/[\s\240]+/ /g; } > print "$text"; > } > } > > __DATA__ > > <DD>line 1</DD> > <DD> </DD> > <DD>line 3</DD> John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]