[PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Dennis Snell via internals Mon, 04 Sep 2023 12:55:19 -0700

Thanks for the proposal Niels,

I’ve dealt with my own grief working through issues in DOMDocument and wanting 
it to work but finding it inadequate.

> HTML5

This would be a great starting point; I would love it if we took the 
opportunity to fix named character reference decoding, as PHP has (to my 
knowledge) never respected (at least in HTML5) that they decode differently 
inside attributes as they do inside markup, considering rules such as the 
ambiguous ampersand and decode errors.

It’s also been frustrating that DOMDocument parses tags in RCDATA sections 
where they don’t exist, such as in TITLE or TEXTAREA elements, escapes certain 
types of invalid comments so that they appear rendered in the saved document, 
and misses basic semantic rules (e.g. creating a BUTTON element as a child of a 
BUTTON element instead of closing out the already-open BUTTON).

I’d like to share some what a few of us have been working on inside WordPress, 
which is to build a conformant streaming HTML5 parser:
 - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/
 - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/

It’s just food for thought right now because adding HTML5 support to 
DOMDocument would benefit everyone, but we decided we had common need in PHP to 
work with HTML not in a DOM, but in a streaming fashion, one with very little 
runtime overhead. My long-term plan has been to get a good grasp for the 
interface needs and thoroughly test it within the WordPress community and then 
propose its inclusion into PHP. It’s been incredibly handy so far, and on my 
laptop runs at around 20 MB/s, which is not great, but good enough for many 
needs. My naive C port runs on the same laptop at around 80 MB/s and I believe 
that we can likely triple or quadruple that speed again if any of us working on 
it knew how to take advantage of SIMD instrinsics.

It tries to accomplish a few goals:
 - be fast enough
 - interpret HTML as an HTML5-compliant browser will
 - find specific locations within an HTML document and then read or modify them
 - pass through any invalid HTML it encounters for the browser to resolve/fix 
unless modifying the part of the document containing those invalid constructions

I only bring up this different interface because once we started digging deep 
into DOMDocument we found that the problems with it were far from superficial; 
that there is a host of problems and a mismatched interface to our common 
needs. It has surprised me that PHP, the language of the web, has had such 
trouble handling HTML, the language of the web, and we wanted to completely 
resolve this issue once and for all within WordPress so we can clean up 
decades’ old problems with encoding, decoding, security, and sanitization.

Warmly,
Dennis Snell

> On Sep 2, 2023, at 12:41 PM, Niels Dossche <dossche.ni...@gmail.com 
> <mailto:dossche.ni...@gmail.com>> wrote:
> 
> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
> support".
> https://wiki.php.net/rfc/domdocument_html5_parser
> 
> Kind regards
> Niels

[PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to