Hi! I just encountered an issue with stand-alone less-than characters if the document is parsed by libxml2's HTML parser module. Consider you have a text in your HTML document like:
a < b The less-than sign in this case is interpreted by the HTML parser module as tag start, causing subsequent text (in this case "< b") to be dropped. It is not well-formed HTML to have less-than signs raw like this, however in practice it often occurs with text sections in HTML files this way and browsers cope with it. If allowed, I would provide a patch to address this issue. My suggestion: if the next character following the less-than character is in (' ' | \n | \r | \t | 0 | '=') then the token is interpreted as text, not as element. Relevant code section: HTMLparser.c -> htmlParseContent() Another option would be to recover the original read position if htmlParseHTMLName() failed. Currently it drops the entire supposed element. Relevant code section: HTMLparser.c -> htmlParseStartTag(). Best regards, Christian Schoenebeck _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml