Hi internals

The DOM extension in PHP is used to parse, query and manipulate XML/HTML 
documents. The DOM extension is based on the DOM specification.
Originally this was the DOM Core Level 3 specification, but nowadays, that 
specification has evolved into the current "Living Specification" maintained by 
WHATWG.

Unfortunately, there are many bugs in PHP's DOM extension. Most of those bugs 
are related to namespace and attribute handling. This leads to people trying to 
work around those bugs by relying on more bugs, or on undocumented side-effects 
of incorrect behaviour, leading to even more issues in the end. Furthermore, 
some of these bugs may have security implications [1].

Some of these bugs are caused because the method or property was implemented 
incorrectly back in the day, or because the original specification used to be 
unclear. A smaller part of this is because the specification has made breaking 
changes when HTML 5 first came along and the specification creators had to 
unify what browsers implemented into a single specification that everyone 
agreed on.

It's not possible to "just fix" these bugs because people actually _rely_ on 
these bugs. They are also often unaware that what they're doing is actually 
incorrect or causes the internal document state to be inconsistent. We 
therefore have to fix this in a backwards-compatible way: i.e. a hard 
requirement is that all code written for the current DOM extension keeps 
working without requiring changes.
In short: the main problem is that 20 years of buggy behaviour means that the 
bugs have become ingrained into the system.

Some people have implemented userland DOM libraries on top of the existing DOM 
extension. However, even userland solutions can't fully work around issues 
caused by PHP's DOM extension. The real solution is to provide a BC-preserving 
fix at PHP's side.

Roughly 1.5 months ago I merged my HTML 5 RFC [2] into the PHP 8.4 development 
branch. This RFC introduced new document classes: DOM\HTMLDocument and 
DOM\XMLDocument. The idea here was to preserve backwards compatibility: if the 
user wants to keep using HTML 4, they can keep using the DOMDocument class. 
Also, when the user wants to work with HTML 5 and are currently using 
workarounds, they can migrate on their own pace (without deprecations or 
anything) to the new classes. New code can use DOM\{HTML,XML}Document from the 
start without touching the old classes.

The HTML 5 RFC has left us with an interesting opportunity to also introduce 
the spec bugfixes in a BC-preserving way. The idea is that when the new 
DOM\{HTML,XML}Document classes are used, then the DOM extension will follow the 
DOM specification and therefore get rid of bugs. When you are using the 
DOMDocument class, the old implementations will be used. This means that 
backwards compatibility is kept.

For the past 2.5 weeks I've been working on getting all spec bugs that I know 
of fixed. The full list of bugs that this proposal fixes can be found here: 
https://github.com/nielsdos/php-src/blob/dom-spec-compliance-pub/bugs.md. I 
also found some discussion [3] from some years ago where C. Scott shared a list 
of problems they encountered at Wikimedia [4]. All behavioural issues are fixed 
in my PR [5], although my PR could always use more testing. Currently I have 
tested that existing DOM code does not break (I have tested veewee's XML 
library, Mensbeam library, some SimpleSAML libraries). I have added tests to 
test the new spec-compliant behaviour. I also ported some of the WHATWG's WPT 
DOM tests (DOM spec-compliance testsuite) to PHP and those that I've ported all 
pass [6].

Implementation PR can be found here: https://github.com/php/php-src/pull/13031

Note that this is not a new extension, but an improvement to the existing DOM 
extension. As for "why not an entirely new extension?", please see the 
reasoning in my HTML 5 RFC. All interactions with SimpleXML, XSL, XPath etc 
will remain possible like you are used to. Implementation-wise, a lot of code 
internally is shared between the spec-compliant and old implementations.

I intend to put this up for RFC. There is however one last detail that needs to 
be cleared up: what about "type issues"?
To give an example of a "type issue": there is a `string DOMNode::$prefix` 
property. DOM spec tells us that this should be nullable: when there is no 
prefix for a node, the prefix should return NULL. However, because the property 
is a string, this currently returns an empty string instead in PHP. Not a big 
deal maybe, but there's many of these subtle inconsistencies: null vs false 
return value, arguments that should accept `?string` instead of `string`, etc.
Sadly, it's not possible to fix the typing issues for properties and methods 
for DOMNode, DOMElement, ... because of BC: properties and methods can be 
overridden.
Or is it?

Currently, as a result of the HTML 5 RFC, the new DOM\{HTML,XML}Document 
classes keep using the DOMNode, DOMElement, ... classes.
For consistency, the DOMNode etc class were aliased to the DOM namespace, i.e. 
DOM\Node is an alias for DOMNode, DOM\Element an alias for DOMElement etc.
Being an alias, this means that fixing types for DOM\Node is not possible 
because it's really just another name for DOMNode, so changing it for DOM\Node 
means changing it for DOMNode.
_Unless_ we no longer alias the classes but make them proper classes instead. 
This means we can fix the typing for DOM\Node while keeping DOMNode untouched, 
preserving BC. The downside is that it becomes more difficult for 
interoperability. One of the reasons the HTML 5 RFC introduced aliases instead 
of proper classes is so that code taking a DOMNode as an argument could also be 
passed a DOM\Node. However, if we make it a proper class instead, such code has 
to either transition fully to the new DOM classes _or_ use a type union, e.g. 
DOMNode|DOM\Node.
In my opinion, having them become proper classes instead of aliases has my 
preference: either we fix everything in one go now while we have the 
opportunity, or never.

Let me know what you think, especially regarding the type issues.

Kind regards
Niels

[1] https://github.com/php/php-src/issues/8388
[2] https://wiki.php.net/rfc/domdocument_html5_parser
[3] https://externals.io/message/104687
[4] https://www.mediawiki.org/wiki/Parsoid/PHP/Help_wanted
[5] https://github.com/php/php-src/pull/13031
[6] https://github.com/nielsdos/wpt/tree/master/dom/php-out (yes, this is a 
dirty port)

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to