Re: [xml] [PATCH] less-than character and HTML parser module

Daniel Veillard Mon, 29 Jun 2015 20:44:36 -0700

On Thu, Apr 16, 2015 at 04:32:32PM +0800, Daniel Veillard wrote:
> On Tue, Apr 14, 2015 at 05:43:42PM +0200, Christian Schoenebeck wrote:
> > On Tuesday 14 April 2015 17:50:51 you wrote:
> > > If anything like this does get put in, it should only be if it is a
> > > configurable option that is disabled by default - an xml parser should
> > > only accept a strictly-conforming document by default. Adding support for
> > > ‘broken’ html because other (weak) parsers allow it is not a good plan as
> > > it causes divergence from the standard.
> > 
> > There you go; you find the updated patch attached. It now requires 
> > HTML_PARSE_RECOVER option to be set for recovering from stand-alone 
> > less-than 
> > characters.
> 
> That sounds fine *except* it doesn't raise an error.
> The parser knows it's a broken construct that must be pointed out.
> 
> thinkpad2:~/XML -> ./xmllint --html tst.html
> tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
> <p> blah < booh </p>
>           ^
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
> "http://www.w3.org/TR/REC-html40/loose.dtd";>
> <html>
> <body>
> <p> blah 
> </p>
> </body>
> </html>
> thinkpad2:~/XML -> ./xmllint --html --recover tst.html
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
> "http://www.w3.org/TR/REC-html40/loose.dtd";>
> <html>
> <body>
> <p> blah &lt; booh </p>
> </body>
> </html>
> thinkpad2:~/XML -> 
> 
>  the fact that we worked around a broken start tag construct must be reported.
> Whether we do that with the recovery option or not is less important IMHO.
> 
>  It sounds a bit weird to handle that error case as one of the main content
> cases, I would still be tempted to go into htmlParseStartTag, get the
> error reported, but push corrective data instead in recover mode.
> 
>  Can we get a v3 ? :-)
> 
>   thanks
> 
> Daniel


  Okay, I did it,
it does what you expect, it doesn't rewing on input, it doesn't
modify the main content loop routine, and it raises the same error
message as when processed in non-recovery mode:

 
https://git.gnome.org/browse/libxml2/commit/?id=140c251e8e5653572edcca91b9d675f871735cb4

thinkpad:~/XML -> cat tst.html
<body>
<p>  a <b </p>
<p>  a < b </p>
<p> a < b> </p>
<p> a <0 </p>
<p> a <=0 </p>
</body>
thinkpad:~/XML -> ./xmllint --html --recover tst.html
tst.html:2: HTML parser error : error parsing attribute name
<p>  a <b </p>
          ^
tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
<p>  a < b </p>
        ^
tst.html:4: HTML parser error : htmlParseStartTag: invalid element name
<p> a < b> </p>
       ^
tst.html:5: HTML parser error : htmlParseStartTag: invalid element name
<p> a <0 </p>
       ^
tst.html:6: HTML parser error : htmlParseStartTag: invalid element name
<p> a <=0 </p>
       ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p>  a <b>
</b></p>
<p>  a &lt; b </p>
<p> a &lt; b&gt; </p>
<p> a &lt;0 </p>
<p> a &lt;=0 </p>
</body>
</html>
thinkpad:~/XML -> 

  thanks for raising the issue and the initial patches !

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veill...@redhat.com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] [PATCH] less-than character and HTML parser module

Reply via email to