Re: HTML::TokeParser and

John W. Krahn Wed, 29 Jan 2003 05:31:06 -0800

David Eason wrote:
> 
> John W. Krahn wrote:
> > According to HTML::Entities
> >
> >  # Some extra Latin 1 chars that are listed in the HTML3.2 draft
> > (21-May-96)
> >  copy   => 'Š',  # copyright sign
> >  reg    => 'Ž',  # registered sign
> >  nbsp   => "\240", # non breaking space
> 
> Thanks, John, I had no idea where to look. I didn't know a non-breaking
> space was an actual character, I thought it was just a directive to the
> browser.


AFAIK it is.

> I have corrected the code below accordingly and it prints "line
> 1line 3" as desired.

FWIW on my computer "\240" prints a "space".  :-)

> use strict;
> use warnings;
> use HTML::TokeParser;
> 
> my $p = HTML::TokeParser->new(*DATA) or die "Can't open: $!";
> while (my $tag = $p->get_tag())
> {
>     if ($tag->[0] eq "dd")
>     {
>         my $text = $p->get_trimmed_text();
>         $text =~ s/^[\s\240]*(.*?)[\s\240]*$/$1/;

If you are going to do that then you might as well call get_text and do
all the trimming yourself.

          my $text = $p->get_text();
          for ( $text ) {
              s/^[\s\240]+//;
              s/[\s\240]+$//;
              s/[\s\240]+/ /g;
              }

>         print "$text";
>     }
> }
> 
> __DATA__
> 
> <DD>line 1</DD>
> <DD>&nbsp;</DD>
> <DD>line 3</DD>


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML::TokeParser and

Reply via email to