Parsing HTML

Scott Taylor Mon, 29 Aug 2005 10:31:37 -0700

Hi,

I suck at regex, but getting better. :)


I'm probably reinventing the wheel here, but I tried to get along with
HTML::Parser and just couldn't get it to do anything.  To confusing, I
think.

I simply want to get a list or real words from an HTML string, minus all
the HTML stuff.  For example:

$a = 'This is a line of HTML:people write strange things here<br>
and hardly ever follow proper<p>
syntax A&amp;B suck at spelling as well<br>
So I need to clean it up and strip out all<br>

words less then 3 characters in length.<p>

Later the words will go into an indexer for<br>
searching a database';

$a =~ s/<[^>]*>//gs;
$a =~ s/&amp;/&/gs;  # probably need to add more like this
@data = split (/ /,$a);
foreach $b (@data) {
  foreach $b (split (/\n/,$b)){
    foreach $b (split (/:/,$b)){
      $b =~ s/^\s+//;
      $b =~ s/\s+$//;
      $b =~ s/\n//g;
      $b =~ s/\c//g;
      $b =~ s/[,.-;?]//gs;
      if ($b and (length($b) > 3)){
        print "D$b\n";
      }
    }
  }
}

Is there a better, maybe more eligant, way to do this?  I don't mind to
use HTML::Parser if I could only figure out how.

Cheers.

--
Scott

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Parsing HTML

Reply via email to