On Mon, Jun 29, 2015 at 05:36:58PM +0200, Stefan Behnel wrote: > Bruce Miller schrieb am 28.05.2015 um 18:37: > > On 05/28/2015 12:29 PM, Noam Postavsky wrote: > >> On Thu, May 28, 2015 at 12:13 PM, Frank Gross wrote: > >>> Are there any plans to support parsing of HTML V5 in libxml ? I tried > >>> function htmlCtxtReadMemory(), but it raises an error for HTML document > >>> containing tags introduced in HTML V5 such as : Tag header invalid. > > > > I'd love to see this happen! I'm so used to the libxml2 tools, > > and the tools built upon them, it would SO simplify my life. > > > >> I think the same question has already been asked, and answered at > >> https://mail.gnome.org/archives/xml/2013-April/msg00006.html > > > > Sorta, yes. But HTML5 is essentially _defined_ by it's parser rather than > > by it's spec. In particular the (annoying) way that it rewrites the DOM > > to turn what you wrote into what it wants. That being the case, there's > > more to adapting libxml's HTML parser than just being more forgiving about > > the unrecognized tags --- the resulting DOM might not be quite what HTML5 > > specifies! > > I think most people would be happy if the new tags were recognised > correctly, e.g. the self-closing ones. Whether or not the resulting DOM > tree is strictly HTML5 parsing conform or not - does it really matter that > much?
I assume that would not make us conformant, but that would make us less bad :-) > > > Which is all to say that it's not quite trivial; would probably amount to > > importing the "official" parser and modifying it to create libxml's internal > > structure. Sadly, Daniel doesn't have the time. Nor, alas, do I. > > There's a long list of tag metadata in the HTMLparser.c file. I'm sure a > patch that adds just a couple of the new tags would be warmly appreciated. > As long as everyone just goes "*I* don't have time ATM, not even to add one > little tag", nothing's going to change here. Agreed, that's one way to do it, and based on my current work status I don't see any "ample free time" coming any time soon, so we'd better be very practical. Recognizing that a document is HTML5, extending the list of tags name (did HTML deprecate some of those in HTML4 ?) and associated attributes would be a relatively simple first step. Someone up to the task, or is there somewhere a list of HTML5 extensions compared to HTML4 ? thanks, Daniel -- Daniel Veillard | Open Source and Standards, Red Hat veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml