On Mon, Jun 29, 2015 at 05:36:58PM +0200, Stefan Behnel wrote:
> Bruce Miller schrieb am 28.05.2015 um 18:37:
> > On 05/28/2015 12:29 PM, Noam Postavsky wrote:
> >> On Thu, May 28, 2015 at 12:13 PM, Frank Gross wrote:
> >>>   Are there any plans to support parsing of HTML V5 in libxml ? I tried
> >>> function htmlCtxtReadMemory(), but it raises an error for HTML document
> >>> containing tags introduced in HTML V5 such as : Tag header invalid.
> > 
> > I'd love to see this happen!  I'm so used to the libxml2 tools,
> > and the tools built upon them, it would SO simplify my life.
> > 
> >> I think the same question has already been asked, and answered at
> >> https://mail.gnome.org/archives/xml/2013-April/msg00006.html
> > 
> > Sorta, yes. But HTML5 is essentially _defined_ by it's parser rather than
> > by it's spec. In particular the (annoying) way that it rewrites the DOM
> > to turn what you wrote into what it wants.  That being the case, there's
> > more to adapting libxml's HTML parser than just being more forgiving about
> > the unrecognized tags --- the resulting DOM might not be quite what HTML5
> > specifies!
> 
> I think most people would be happy if the new tags were recognised
> correctly, e.g. the self-closing ones. Whether or not the resulting DOM
> tree is strictly HTML5 parsing conform or not - does it really matter that
> much?

 I assume that would not make us conformant, but that would make us less bad :-)

> 
> > Which is all to say that it's not quite trivial; would probably amount to
> > importing the "official" parser and modifying it to create libxml's internal
> > structure.  Sadly, Daniel doesn't have the time.   Nor, alas, do I.
> 
> There's a long list of tag metadata in the HTMLparser.c file. I'm sure a
> patch that adds just a couple of the new tags would be warmly appreciated.
> As long as everyone just goes "*I* don't have time ATM, not even to add one
> little tag", nothing's going to change here.

  Agreed, that's one way to do it, and based on my current work status
I don't see any "ample free time" coming any time soon, so we'd better be
very practical.
  Recognizing that a document is HTML5, extending the list of tags name
(did HTML deprecate some of those in HTML4 ?) and associated attributes
would be a relatively simple first step.

  Someone up to the task, or is there somewhere a list of HTML5 extensions
compared to HTML4 ?

   thanks,

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veill...@redhat.com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to