Re: Validating XML email

Kenneth Porter Fri, 24 Oct 2008 12:21:46 -0700

I found that "tidy -eq" gives a pretty good result. To normalize the score,I figure it makes sense to divide the resulting line count by the bytecount of the input file.

I ran some MS Outlook output through and the most frequent complaint wasthe unknown tag <o:p>, but there was also a nesting issue involving <span>,<font>, and <hr>. (I'm guessing tidy doesn't understand namespaces or howto load the MS Office namespace needed to resolve <o:p>.)

Some known spam generated a much higher result, about 0.003errors/character versus 0.001 for the Outlook email. But this wasn't a realsample. For that I'd need to generate a plugin wrapper for tidy and run itover a corpus. (I've got the beginnings of such a plugin coded, based onthe PDFassassin plugin which in turn in based on the Ocr plugin.)

Re: Validating XML email

Reply via email to