On 23.08.2024 at 01:02, Dennis Snell wrote:

>> If we could have a single implementation, that would be great. I do 
>> understand of course your concern that DOM is not a required extension, and 
>> therefore basing the internals on Lexbor makes it tied to the DOM extension 
>> which may not be available. I however suspect that a large chunk of people 
>> needing a function like this have DOM available (as DOM is required by many 
>> HTML-processing-related packages). I can also look into it sometime soon if 
>> you want; anyway feel free to ping me.
>
> I’m also very open to lexbor-based approaches but I’ve so-far found it more 
> complicated than I expected. In some part this is because it involves setting 
> up the parser and state machine for the HTML specification and much of the 
> actual decoding can be safely done without this.
>
> The other part is the extension aspect. I hear you, that you would expect 
> calling code to have the DOM extensions available, but that’s simply not the 
> case when developing a platform like WordPress, which I do. We don’t have 
> control over the servers or environments where people are deploying this, and 
> the availability of the DOM extensions is low enough that WordPress code 
> simply cannot use `DOMDocument` (even though it shouldn’t because of the wild 
> problems that has for attempting to parse HTML).
>
> People resort to `html_entity_decode()` because that’s the only option. In 
> WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP 
> its performance is far below what’s possible.
>
> I’d love your help in setting up lexbor’s state machine to decode text nodes. 
> I’d love it even more if this could be part of the PHP language. It 
> constantly surprises me that _the language of the web_ (PHP) doesn’t have the 
> tools to speak _the language of the web_ (HTML). This RFC is all about taking 
> a step towards ensuring that PHP developers can rely on PHP to be a reliable 
> middle-man between the HTML domain and the PHP domain.
>
> In other words, requiring the DOM extension or `DOM\HtmlDocument` would be 
> such a non-starter for WordPress (accounting for 43% of the web today) that 
> it would completely unavailable.

Well, I don't think it would be a big deal to move the bundled lexbor to
somewhere where it is always available.  I mean, so far it's only used
by ext/dom so it's bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Christoph

Reply via email to