> On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoeneb...@crudebyte.com> > wrote: > > On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote: >> On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoeneb...@crudebyte.com> > wrote: >>> I just encountered an issue with stand-alone less-than characters if the >>> document is parsed by libxml2's HTML parser module. Consider you have a >>> text >>> >>> in your HTML document like: >>> a < b >>> >>> The less-than sign in this case is interpreted by the HTML parser module >>> as tag start, causing subsequent text (in this case "< b") to be >>> dropped. >> >> Isn't that correct? Shouldn't your document have >> >> a < b > > If it was a well-formed HTML document, then yes. But as said, in reality there > are a load of HTML documents which contain text with raw less-than characters, > supported by the fact that all major HTML browsers can handle it. libxml's > HTML parser is yet an exception here. > > Attached you find a patch, suggesting a fix for this issue.
If anything like this does get put in, it should only be if it is a configurable option that is disabled by default - an xml parser should only accept a strictly-conforming document by default. Adding support for ‘broken’ html because other (weak) parsers allow it is not a good plan as it causes divergence from the standard. -- Chris Tapp opensou...@keylevel.com www.keylevel.com ---- You can tell you're getting older when your car insurance gets real cheap!
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml