On Tue, Apr 14, 2015 at 05:43:42PM +0200, Christian Schoenebeck wrote: > On Tuesday 14 April 2015 17:50:51 you wrote: > > If anything like this does get put in, it should only be if it is a > > configurable option that is disabled by default - an xml parser should > > only accept a strictly-conforming document by default. Adding support for > > ‘broken’ html because other (weak) parsers allow it is not a good plan as > > it causes divergence from the standard. > > There you go; you find the updated patch attached. It now requires > HTML_PARSE_RECOVER option to be set for recovering from stand-alone less-than > characters.
That sounds fine *except* it doesn't raise an error. The parser knows it's a broken construct that must be pointed out. thinkpad2:~/XML -> ./xmllint --html tst.html tst.html:3: HTML parser error : htmlParseStartTag: invalid element name <p> blah < booh </p> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <p> blah </p> </body> </html> thinkpad2:~/XML -> ./xmllint --html --recover tst.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <p> blah < booh </p> </body> </html> thinkpad2:~/XML -> the fact that we worked around a broken start tag construct must be reported. Whether we do that with the recovery option or not is less important IMHO. It sounds a bit weird to handle that error case as one of the main content cases, I would still be tempted to go into htmlParseStartTag, get the error reported, but push corrective data instead in recover mode. Can we get a v3 ? :-) thanks Daniel > Best regards, > Christian Schoenebeck > diff -u libxml2-2.9.1+dfsg1.orig/HTMLparser.c libxml2-2.9.1+dfsg1/HTMLparser.c > --- libxml2-2.9.1+dfsg1.orig/HTMLparser.c 2015-04-14 13:05:01.000000000 > +0200 > +++ libxml2-2.9.1+dfsg1/HTMLparser.c 2015-04-14 18:22:41.143973776 +0200 > @@ -2948,8 +2948,10 @@ > > > /** > - * htmlParseCharData: > + * htmlParseCharDataInternal: > * @ctxt: an HTML parser context > + * @prep: optional character to be prepended to text, 0 if no character > + * shall be prepended > * > * parse a CharData section. > * if we are within a CDATA section ']]>' marks an end of section. > @@ -2958,12 +2960,15 @@ > */ > > static void > -htmlParseCharData(htmlParserCtxtPtr ctxt) { > - xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5]; > +htmlParseCharDataInternal(htmlParserCtxtPtr ctxt, char prep) { > + xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 6]; > int nbchar = 0; > int cur, l; > int chunk = 0; > > + if (prep) > + buf[nbchar++] = prep; > + > SHRINK; > cur = CUR_CHAR(l); > while (((cur != '<') || (ctxt->token == '<')) && > @@ -3043,6 +3048,21 @@ > } > > /** > + * htmlParseCharData: > + * @ctxt: an HTML parser context > + * > + * parse a CharData section. > + * if we are within a CDATA section ']]>' marks an end of section. > + * > + * [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) > + */ > + > +static void > +htmlParseCharData(htmlParserCtxtPtr ctxt) { > + htmlParseCharDataInternal(ctxt, 0); > +} > + > +/** > * htmlParseExternalID: > * @ctxt: an HTML parser context > * @publicID: a xmlChar** receiving PubidLiteral > @@ -4157,14 +4177,24 @@ > } > > /* > - * Third case : a sub-element. > + * Third case : (unescaped) stand-alone less-than character. > + * Only if HTML_PARSE_RECOVER option is set. > + */ > + else if (ctxt->recovery && (CUR == '<') && > + (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) { > + NEXT; > + htmlParseCharDataInternal(ctxt, '<'); > + } > + > + /* > + * Fourth case : a sub-element. > */ > else if (CUR == '<') { > htmlParseElement(ctxt); > } > > /* > - * Fourth case : a reference. If if has not been resolved, > + * Fifth case : a reference. If if has not been resolved, > * parsing returns it's Name, create the node > */ > else if (CUR == '&') { > @@ -4172,7 +4202,7 @@ > } > > /* > - * Fifth case : end of the resource > + * Sixth case : end of the resource > */ > else if (CUR == 0) { > htmlAutoCloseOnEnd(ctxt); > @@ -4567,7 +4597,17 @@ > } > > /* > - * Third case : a sub-element. > + * Third case : (unescaped) stand-alone less-than character. > + * Only if HTML_PARSE_RECOVER option is set. > + */ > + else if (ctxt->recovery && (CUR == '<') && > + (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) { > + NEXT; > + htmlParseCharDataInternal(ctxt, '<'); > + } > + > + /* > + * Fourth case : a sub-element. > */ > else if (CUR == '<') { > htmlParseElementInternal(ctxt); > @@ -4578,7 +4618,7 @@ > } > > /* > - * Fourth case : a reference. If if has not been resolved, > + * Fifth case : a reference. If if has not been resolved, > * parsing returns it's Name, create the node > */ > else if (CUR == '&') { > @@ -4586,7 +4626,7 @@ > } > > /* > - * Fifth case : end of the resource > + * Sixth case : end of the resource > */ > else if (CUR == 0) { > htmlAutoCloseOnEnd(ctxt); > _______________________________________________ > xml mailing list, project page http://xmlsoft.org/ > xml@gnome.org > https://mail.gnome.org/mailman/listinfo/xml -- Daniel Veillard | Open Source and Standards, Red Hat veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml