On 23.08.2024 at 01:02, Dennis Snell wrote: >> If we could have a single implementation, that would be great. I do >> understand of course your concern that DOM is not a required extension, and >> therefore basing the internals on Lexbor makes it tied to the DOM extension >> which may not be available. I however suspect that a large chunk of people >> needing a function like this have DOM available (as DOM is required by many >> HTML-processing-related packages). I can also look into it sometime soon if >> you want; anyway feel free to ping me. > > I’m also very open to lexbor-based approaches but I’ve so-far found it more > complicated than I expected. In some part this is because it involves setting > up the parser and state machine for the HTML specification and much of the > actual decoding can be safely done without this. > > The other part is the extension aspect. I hear you, that you would expect > calling code to have the DOM extensions available, but that’s simply not the > case when developing a platform like WordPress, which I do. We don’t have > control over the servers or environments where people are deploying this, and > the availability of the DOM extensions is low enough that WordPress code > simply cannot use `DOMDocument` (even though it shouldn’t because of the wild > problems that has for attempting to parse HTML). > > People resort to `html_entity_decode()` because that’s the only option. In > WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP > its performance is far below what’s possible. > > I’d love your help in setting up lexbor’s state machine to decode text nodes. > I’d love it even more if this could be part of the PHP language. It > constantly surprises me that _the language of the web_ (PHP) doesn’t have the > tools to speak _the language of the web_ (HTML). This RFC is all about taking > a step towards ensuring that PHP developers can rely on PHP to be a reliable > middle-man between the HTML domain and the PHP domain. > > In other words, requiring the DOM extension or `DOM\HtmlDocument` would be > such a non-starter for WordPress (accounting for 43% of the web today) that > it would completely unavailable.
Well, I don't think it would be a big deal to move the bundled lexbor to somewhere where it is always available. I mean, so far it's only used by ext/dom so it's bundled there, but if other parts of the php-src code base would use it, we could put it elsewhere. Christoph