>I have a scalar variable containing HTML that needs to be converted
 >to XML.  It's not the best HTML so it has invalid characters (like
 >smart quotes, 1/2 character, etc.).  I need to determine if these
 >characters exist in the data and throw an error if they do.  What
 >is the best way to do this?  I can't use an XML parser because it's
 >not really XML.

Welp, ultimately, if you were using an XML parser, it would
choke on the bad data. For instance, this code:

  use XML::Simple;
  my $data = eval { XMLin( $xml_data ); };
  if ($@) { print $@; }

Would produce error messages like:

   There was an error loading morelikethisweblog.xml: not
   well-formed at line 72, column 28, byte 4001 at
   C:/Perl/site/lib/XML/Parser.pm line 168.

   There was an error loading hackintheboxorg.xml: mismatched tag
   at line 168, column 2, byte 4751 at
   C:/Perl/site/lib/XML/Parser.pm line 168.

A cheat would be to:

   my $invalid_data_check = "<data>$real_data</data>";

And then XMLin on $invalid_data_check, as above. Another option is to HTML 
encode all the data before passing it off to the XML creator/parsing code:

   use HTML::Entities qw( %char2entity );
   $real_data =~ s/([^\s!\#\$&%\'-;=?-~<>"])/$char2entity{$1}/g;

(note, in this example, I'm importing the char2entity hash myself, which 
allows me to define exactly what characters I DO NOT want turned into 
entities (the first part of the regexp). Check the man page for the defaults.

With the above in hand, my XML parsing usually runs like this:

  use XML::Simple;
  my $data = eval { XMLin( $xml_data ); };
  if ($@) {
   print "$@, attempting to repair.";
   $xml_data =~ s/([^\s!\#\$&%\'-;=?-~<>"])/$char2entity{$1}/g;
   eval { XMLin( $xml_data ); }
   if ($@) { print "Nope. Still an error."; }
  }

You can probably modify that to your use.


-- 
Morbus Iff ( softcore vulcan pr0n rulezzzzz )
http://www.disobey.com/ && http://www.gamegrene.com/
please me: http://www.amazon.com/exec/obidos/wishlist/25USVJDH68554
icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to