Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Dennis Snell Sat, 24 Aug 2024 13:38:53 -0700

> On Aug 24, 2024, at 7:47 AM, Christoph M. Becker <cmbecke...@gmx.de> wrote:
> 
> On 23.08.2024 at 01:02, Dennis Snell wrote:
> 
>>> If we could have a single implementation, that would be great. I do 
>>> understand of course your concern that DOM is not a required extension, and 
>>> therefore basing the internals on Lexbor makes it tied to the DOM extension 
>>> which may not be available. I however suspect that a large chunk of people 
>>> needing a function like this have DOM available (as DOM is required by many 
>>> HTML-processing-related packages). I can also look into it sometime soon if 
>>> you want; anyway feel free to ping me.
>> 
>> I’m also very open to lexbor-based approaches but I’ve so-far found it more 
>> complicated than I expected. In some part this is because it involves 
>> setting up the parser and state machine for the HTML specification and much 
>> of the actual decoding can be safely done without this.
>> 
>> The other part is the extension aspect. I hear you, that you would expect 
>> calling code to have the DOM extensions available, but that’s simply not the 
>> case when developing a platform like WordPress, which I do. We don’t have 
>> control over the servers or environments where people are deploying this, 
>> and the availability of the DOM extensions is low enough that WordPress code 
>> simply cannot use `DOMDocument` (even though it shouldn’t because of the 
>> wild problems that has for attempting to parse HTML).
>> 
>> People resort to `html_entity_decode()` because that’s the only option. In 
>> WordPress we now have a spec-compliant decoder, but as it’s in user-space 
>> PHP its performance is far below what’s possible.
>> 
>> I’d love your help in setting up lexbor’s state machine to decode text 
>> nodes. I’d love it even more if this could be part of the PHP language. It 
>> constantly surprises me that _the language of the web_ (PHP) doesn’t have 
>> the tools to speak _the language of the web_ (HTML). This RFC is all about 
>> taking a step towards ensuring that PHP developers can rely on PHP to be a 
>> reliable middle-man between the HTML domain and the PHP domain.
>> 
>> In other words, requiring the DOM extension or `DOM\HtmlDocument` would be 
>> such a non-starter for WordPress (accounting for 43% of the web today) that 
>> it would completely unavailable.
> 
> Well, I don't think it would be a big deal to move the bundled lexbor to
> somewhere where it is always available.  I mean, so far it's only used
> by ext/dom so it's bundled there, but if other parts of the php-src code
> base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an extension would 
open up many new possibilities. For example, WordPress test suites don’t have 
any functional “assertEquivalentMarkup()” functions because there’s no 
spec-compliant parser in PHP. We finally wrote our own user-space HTML parser, 
but relying on `DOM\HtmlDocument` would be much easier.

These test suites need to run on a variety of environments and PHP versions, so 
it’s moot thinking we could hasten the use of a native class to get the job 
done, but if it remains locked inside an optional extension, it may be 
borderline impossible to ever migrate to it.

> 
> Christoph
> 

Dennis Snell
Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Reply via email to