From: Ben Siders <[EMAIL PROTECTED]> > I've got a real easy one here (in theory). I have some XML files that > were generated by a program, but generated imperfectly. There's some > naked ampersands that need to be converted to &. I need a regexp > that will detect them and change them. Sounds easy enough. > > The pattern I want to match is an ampersand that is NOT immediately > followed by a few characters and then a semicolon. Any ideas? > > This is the best I've come up with so far. It should match an > ampersand whose following characters, up to five, are not semicolons. > I don't feel that this is a great solution. I'm hoping the community > can think of a better one. > > $line =~ s/\&[^;]{,5}/\&/g; > > I'm hoping that'll match something like: "<tag>Blah data &</tag>", > but NOT match "<tag>Blah &</tag>". > > I'm not sure if I'm on the right track here. I also can't match other > escaped characters such as: "<tag>Copyright © 2003</tag>".
For something similar I use this (I have it inside a module): use HTML::Entities; sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?- ~').$2}gem; } else { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?- ~').$2}gem; } return $str; } It escapes the &, < and > that doesn't seem to belong to HTML entities or tags. If you would use this over the XML you would want to set the $AllowXHTML (or just use the first branch). If all you want is to process the ampersand you may want something like this: $line =~ s/&(?!\w+;|#\d+;)/&/g; Jenda ===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz ===== When it comes to wine, women and song, wizards are allowed to get drunk and croon as much as they like. -- Terry Pratchett in Sourcery -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]