On Wed, 13 Feb 2002, John wrote: > I have a scalar variable containing HTML that needs to be converted > to XML. It's not the best HTML so it has invalid characters (like > smart quotes, 1/2 character, etc.). I need to determine if these > characters exist in the data and throw an error if they do. What > is the best way to do this? I can't use an XML parser because it's > not really XML.
But you can use an HTML Parser, such as HTML::Parser. There are some useful subclasses of this like HTML::LinkExtor and HTML::TokeParser. > Also, if I have a block of text like this: > > <!-- begin article1 title -->title1<!-- end article1 --> > <!-- begin article1 body -->body1<!-- end article1 body --> > ... > <!-- begin articleN title -->titleN<!-- end articleN title> > <!-- begin articleN body -->bodyN<!-- end articleN body --> > > Where the ... means there could be some number of articles (less > than 5), can anyone think of a relatively simple regex (I mean I > don't want to have article1, article2, etc. hard-coded in the regex) Don't use regex to pull apart HTML, it'll be trouble that it's worth. -- Brett http://www.chapelperilous.net/ ------------------------------------------------------------------------ /* And you'll never guess what the dog had */ /* in its mouth... */ -- Larry Wall in stab.c from the perl source code -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]