Re: [xml] [PATCH] less-than character and HTML parser module

Chris Tapp Tue, 14 Apr 2015 09:35:26 -0700

> On 14 Apr 2015, at 15:24, Christian Schoenebeck <schoeneb...@crudebyte.com> 
> wrote:
> 
> On Tuesday 14 April 2015 09:31:25 Alex Bligh wrote:
>> On 13 Apr 2015, at 22:43, Christian Schoenebeck <schoeneb...@crudebyte.com>
> wrote:
>>> I just encountered an issue with stand-alone less-than characters if the
>>> document is parsed by libxml2's HTML parser module. Consider you have a
>>> text
>>> 
>>> in your HTML document like:
>>>     a < b
>>> 
>>> The less-than sign in this case is interpreted by the HTML parser module
>>> as tag start, causing subsequent text (in this case "< b") to be
>>> dropped.
>> 
>> Isn't that correct? Shouldn't your document have
>> 
>>     a &lt; b
> 
> If it was a well-formed HTML document, then yes. But as said, in reality there
> are a load of HTML documents which contain text with raw less-than characters,
> supported by the fact that all major HTML browsers can handle it. libxml's
> HTML parser is yet an exception here.
> 
> Attached you find a patch, suggesting a fix for this issue.


If anything like this does get put in, it should only be if it is a 
configurable option that is disabled by default - an xml parser should only 
accept a strictly-conforming document by default. Adding support for ‘broken’ 
html because other (weak) parsers allow it is not a good plan as it causes 
divergence from the standard.

--

Chris Tapp
opensou...@keylevel.com
www.keylevel.com

----
You can tell you're getting older when your car insurance gets real cheap!

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] [PATCH] less-than character and HTML parser module

Reply via email to