On Jul 29, 2004, at 7:52 AM, Francesco del Vecchio wrote:
Hi guys,
Hello.
I have a problem with a Regular expression.
I have to delete from a text all HTML tags but not the DIV one (keeping all the parameters in the tag).
This is a complex problem. Your solution is pretty naive and will only work on a tight set of HTML, formatted as you expect it to be.
I'm not saying that's a problem. If you know your HTML will stay simple, it isn't.
However, if you need or even think you may someday need a more robust approach, you should check out the HTML parsing modules on the CPAN.
I've done this:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #!/usr/bin/perl use strict;
I would add:
use warnings;
This doesn't do anything for you here, but it's a good habit to build. It often makes finding errors much easier.
my $test=<<EOS; <html><head><meta content="MSHTML 6.00.2800.1400" name="GENERATOR"> </head><body><font face="Courier New" size=2> =========SUPER SAVING========= <br> -product one <br> -product two <br><D> -product three <br><dIV section=true> ============================== <Br></DIV> <br><br></font></body> </html> EOS $test=~s/<br>/\n/ig;
A little less naive might be:
$test =~ s/<\s*br\s*>/\n/ig;
Even that wouldn't catch the now common <br /> though. Again, use a module if this kind of thing is important.
$test=~s/<^[DIV](.*?)>//ig;
This is currently removing zero tags. You are asking for a <, followed by the beginning of the string (^). That is impossible, and thus never matches. I believe you meant [^DIV]+, which means one or more non D, I, or V characters, but that won't work either for reasons you pointed out.
Here's a simple fix:
$test =~ s/<(?!\/?DIV)[^>]+>//ig;
That searches for a <, then uses a negative look-ahead assertion to verify that a DIV or /DIV is not next, and finally grabs everything up to the next >. It works on the example you provided.
I know I sound like a broken record, but I must again stress how weak this is. If the HTML contains a < DIV> (note the space), it won't work properly. Again, parsing HTML is painful, use a module and benefit from the suffering of others if you need an intelligent solution.
print $test; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hope that helps.
James
P.S. You can use whitespace (blanks lines and spaces) to pretty up your code a little. Your eyes will thank you. Don't worry, it's free! ;)
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>