Re: Scan data for XML invalid characters and parse articles

Brett W. McCoy Wed, 13 Feb 2002 08:40:30 -0800

On Wed, 13 Feb 2002, John wrote:

> I have a scalar variable containing HTML that needs to be converted
> to XML.  It's not the best HTML so it has invalid characters (like
> smart quotes, 1/2 character, etc.).  I need to determine if these
> characters exist in the data and throw an error if they do.  What
> is the best way to do this?  I can't use an XML parser because it's
> not really XML.


But you can use an HTML Parser, such as HTML::Parser.  There are some
useful subclasses of this like HTML::LinkExtor and HTML::TokeParser.

> Also, if I have a block of text like this:
>
> <!-- begin article1 title -->title1<!-- end article1 -->
> <!-- begin article1 body -->body1<!-- end article1 body -->
> ...
> <!-- begin articleN title -->titleN<!-- end articleN title>
> <!-- begin articleN body -->bodyN<!-- end articleN body -->
>
> Where the ... means there could be some number of articles (less
> than 5), can anyone think of a relatively simple regex (I mean I
> don't want to have article1, article2, etc. hard-coded in the regex)

Don't use regex to pull apart HTML, it'll be trouble that it's worth.

-- Brett

                                          http://www.chapelperilous.net/
------------------------------------------------------------------------
/* And you'll never guess what the dog had */
/*   in its mouth... */
             -- Larry Wall in stab.c from the perl source code


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scan data for XML invalid characters and parse articles

Reply via email to