Hi, I suck at regex, but getting better. :)
I'm probably reinventing the wheel here, but I tried to get along with HTML::Parser and just couldn't get it to do anything. To confusing, I think. I simply want to get a list or real words from an HTML string, minus all the HTML stuff. For example: $a = 'This is a line of HTML:people write strange things here<br> and hardly ever follow proper<p> syntax A&B suck at spelling as well<br> So I need to clean it up and strip out all<br> words less then 3 characters in length.<p> Later the words will go into an indexer for<br> searching a database'; $a =~ s/<[^>]*>//gs; $a =~ s/&/&/gs; # probably need to add more like this @data = split (/ /,$a); foreach $b (@data) { foreach $b (split (/\n/,$b)){ foreach $b (split (/:/,$b)){ $b =~ s/^\s+//; $b =~ s/\s+$//; $b =~ s/\n//g; $b =~ s/\c//g; $b =~ s/[,.-;?]//gs; if ($b and (length($b) > 3)){ print "D$b\n"; } } } } Is there a better, maybe more eligant, way to do this? I don't mind to use HTML::Parser if I could only figure out how. Cheers. -- Scott -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>