From: Ben Siders <[EMAIL PROTECTED]>
> I've got a real easy one here (in theory).  I have some XML files that
> were generated by a program, but generated imperfectly.  There's some
> naked ampersands that need to be converted to &amp;.  I need a regexp
> that will detect them and change them.  Sounds easy enough.
> 
> The pattern I want to match is an ampersand that is NOT immediately
> followed by a few characters and then a semicolon.  Any ideas?
> 
> This is the best I've come up with so far.  It should match an
> ampersand whose following characters, up to five, are not semicolons. 
> I don't feel that this is a great solution.  I'm hoping the community
> can think of a better one.
> 
> $line =~ s/\&[^;]{,5}/\&amp;/g;
> 
> I'm hoping that'll match something like:  "<tag>Blah data &</tag>",
> but NOT match "<tag>Blah &amp;</tag>".
> 
> I'm not sure if I'm on the right track here.  I also can't match other
> escaped characters such as: "<tag>Copyright &copy; 2003</tag>".

For something similar I use this (I have it inside a module):

use HTML::Entities;
sub PolishHTML {
        my $str = shift;
        if ($AllowXHTML) {
                $str =~ 
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" 
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$)}
                         {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
        } else {
                $str =~ 
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" 
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)}
                         {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
        }
        return $str;
}

It escapes the &, < and > that doesn't seem to belong to HTML 
entities or tags.
If you would use this over the XML you would want to set the 
$AllowXHTML (or just use the first branch).


If all you want is to process the ampersand you may want something 
like this:

        $line =~ s/&(?!\w+;|#\d+;)/&amp;/g;


Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to