> On Sep 19, 2023, at 12:30 AM, Tim Düsterhus <t...@bastelstu.be> wrote:
> 
> From the perspective of the user of the API, I like the symmetry between all 
> the named constructors:
> 
> Whenever I want to create a new document, I use one of the fromXyz() methods. 
> And when I use those, I get exactly what it says on the tin.
> 
> Making the regular constructor available for use would effectively make 
> whatever that constructor does a special case / the default case.


Just wanted to voice my support for this too. In the WordPress HTML API I 
regret exposing a public constructor on the underlying lexer because that 
prevented me from hiding the constructor on the HTML parsing class, leading to 
an awkward convention to try and beg people not to use it.

Why? the constructor performs various state initialization important for HTML, 
not only for the encoding, but in our case I felt it was important enough that 
we provide both a full-document interface and also the HTML fragment parser, 
since almost all of the actual HTML parsing we do is one fragments. This 
_could_ be something to consider here in the HTMLDocument API as well.

The use of the constructor directly opens up opportunities to bypass that state 
initialization which could lead to incorrect parse results, not only for 
encoding issues, but also in getting way off track in parsing. E.g. if we’re 
parsing the inner contents of a TEMPLATE element then a closing BODY tag should 
be ignored whereas in other parsing modes it closes the BODY element.

> For the HtmlDocument I find it less obvious that creating an empty document 
> would be the default case compared to loading an existing document from a 
> file.


To hit the point home, “creating an empty document” seems a little unclear in 
terms of HTML as well. Are we creating am empty DOM class instance without any 
actual DOM? Are we creating a <!DOCTYPE html> DOM with no nodes? If we proceed 
by feeding it an HTML string and parsing then we can resolve the ambiguity, but 
if we start inserting DOM nodes directly that leaves us in an undefined 
starting point.

Presumably we don’t need this empty method because we could `loadString( ‘’ )`, 
but that leaves open the ambiguities of what kind of empty we’re making. In my 
experience with HTML, any kind of “empty” is going to need a definition of its 
own, and a static creator method is the right place to define and parameterize 
that, as in the updated RFC, though maybe with an additional clarification of 
what “empty” means and what implications it carries for ongoing manipulation.

—

Having static creator methods with docblocks and named arguments to me is a 
great way to not only guide people to use these classes safely, but also to 
teach some of the nuances that have historically been overlooked in PHP’s HTML 
handling, e.g. `html_entity_decode()` is unaware of of the ambiguous ampersand 
and offers no way to correctly parse certain kinds of invalid named character 
references.

Kindly,
Dennis Snell
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to