Re: deleting HTML tag...but not everyone

James Edward Gray II Thu, 29 Jul 2004 06:50:35 -0700

On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:

Hi guys,


Hello.

I have a problem with a Regular expression. I have to delete from a text all HTML tags but not the DIV one (keeping all the parameters in the tag).

This is a complex problem. Your solution is pretty naive and will only work on a tight set of HTML, formatted as you expect it to be.

I'm not saying that's a problem. If you know your HTML will stay simple, it isn't.

However, if you need or even think you may someday need a more robust approach, you should check out the HTML parsing modules on the CPAN.

I've done this:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#!/usr/bin/perl
use strict;


I would add:

use warnings;

This doesn't do anything for you here, but it's a good habit to build. It often makes finding errors much easier.

my $test=<<EOS;
<html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR">
</head><body><font face="Courier New" size=2>
=========SUPER SAVING========= <br>
-product one <br>
-product two <br><D>
-product three <br><dIV section=true>
============================== <Br></DIV>
<br><br></font></body> </html>
EOS
$test=~s/<br>/\n/ig;


A little less naive might be:

$test =~ s/<\s*br\s*>/\n/ig;

Even that wouldn't catch the now common <br /> though. Again, use a module if this kind of thing is important.

$test=~s/<^[DIV](.*?)>//ig;

This is currently removing zero tags. You are asking for a <, followed by the beginning of the string (^). That is impossible, and thus never matches. I believe you meant [^DIV]+, which means one or more non D, I, or V characters, but that won't work either for reasons you pointed out.

Here's a simple fix:

$test =~ s/<(?!\/?DIV)[^>]+>//ig;

That searches for a <, then uses a negative look-ahead assertion to verify that a DIV or /DIV is not next, and finally grabs everything up to the next >. It works on the example you provided.

I know I sound like a broken record, but I must again stress how weak this is. If the HTML contains a < DIV> (note the space), it won't work properly. Again, parsing HTML is painful, use a module and benefit from the suffering of others if you need an intelligent solution.

print $test;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Hope that helps.

James

P.S. You can use whitespace (blanks lines and spaces) to pretty up your code a little. Your eyes will thank you. Don't worry, it's free! ;)


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: deleting HTML tag...but not everyone

Reply via email to