Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Michal Balda Fri, 03 Apr 2020 08:47:12 -0700

Hello Peter,

Those are called conditional comments. They come from MS Word which isused as the HTML rendering engine for MS Outlook. There is not muchdocumentation available online specifically for MS Word but they werealso implemented in older versions of MS Internet Explorer and usedcommonly by web designers to fix bugs and quirks in IE's rendering. SeeWikipedia:

<https://en.wikipedia.org/wiki/Conditional_comment>


https://en.wikipedia.org/wiki/Conditional_comment

<https://en.wikipedia.org/wiki/Conditional_comment>Or just search for"internet explorer conditional comments", you will find plenty of resources.

The ones in your example are the "downlevel-revealed" sort ofconditional comments meaning that the content between the "if" and"endif" is visible to all browsers. The "if" and "endif" themselves arerecognized by MS Word (and MS Internet Explorer) and evaluated asconditions while they are ignored by other web browsers.

The syntax is based on the original SGML syntax which is the precursorto HTML. In this form it is invalid in HTML but standard browsers canhandle it and do the meaningful thing. There exists an alternative form(also described by the Wikipedia page) which is valid HTML and stillworks as a conditional comment:


<!--[if !supportLists]><!-->
<!--<![endif]-->

Just converting it to "" causes it to lose itsmeaning: it won't be recognized any more but if you don't need to openit in MS Word again it doesn't matter.

To answer your question: What should an HTML parser do? I think itdepends on the use case. What XMLHTMLParser does now is wrong. To becorrect, it could signal an error since it's invalid HTML (like an HTMLvalidator would), or it could ignore the syntax error in an unknownelement and continue parsing (like a browser would). Standard HTMLprocessors choose the second approach and try to fix what they can toproduce what they think is most meaningful. In this case they are smartenough to realize that it's probably meant to be a comment. To me,something like a resumable exception would be acceptable: one could maketwo wrappers, a strict one and a loose one, and choose the one thatbetter fits the situation.

(An XML parser, on the other hand, must always signal an exception andabort parsing in case of a syntax error, as per the specification.)



Michal



On 2.4.2020 19:16, PBKResearch wrote:

Hello
I have come across a strange problem in using XMLHTMLParser to parsesome HTML files which use strange constructions. The input files havebeen generated by using MS Outlook to translate incoming messages,stored in .msg files, into HTML. The translated files display normallyin Firefox, and the XMLHTMLParser appears to generate a normal parse,but examination of the parse output shows that the structure isdistorted, and about half the input text has been put into one stringnode.
Hunting around, I am convinced that the trouble lies in the presencein the HTML source of pairs of comment-like tags, with this form:
<![if !supportLists]>

<![endif]>
since the distorted parse starts at the first occurrence of one ofthese tags.
I don’t know whether these are meant to be a structure in someprogramming language – there is no reference to supportLists anywherein the source code. When it is displayed in Firefox, use of the‘Inspect Element’ option shows that the browser has treated them ascomments, displaying them with the necessary dashes as e.g. . I edited the source code by inserting the dashes,and XMLHTMLParser parsed everything correctly.
I have a workaround, therefore; either edit in the dashes to make theminto legitimate comments, or equivalently edit out these tagscompletely. The only question of general interest is whetherXMLHTMLParser should be expected to handle these in some other way,rather than produce a distorted parse without comment. The Firefoxapproach, turning them into comments, seems sensible. It would also beinteresting if anyone has any idea what is going on in the source code.
Thanks for any help

Peter Kenny

Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Reply via email to