> On Sep 19, 2023, at 12:30 AM, Tim Düsterhus <t...@bastelstu.be> wrote:
>
> From the perspective of the user of the API, I like the symmetry between all
> the named constructors:
>
> Whenever I want to create a new document, I use one of the fromXyz() methods.
> And when I use those, I get exactly what it says on the tin.
>
> Making the regular constructor available for use would effectively make
> whatever that constructor does a special case / the default case.
Just wanted to voice my support for this too. In the WordPress HTML API I
regret exposing a public constructor on the underlying lexer because that
prevented me from hiding the constructor on the HTML parsing class, leading to
an awkward convention to try and beg people not to use it.
Why? the constructor performs various state initialization important for HTML,
not only for the encoding, but in our case I felt it was important enough that
we provide both a full-document interface and also the HTML fragment parser,
since almost all of the actual HTML parsing we do is one fragments. This
_could_ be something to consider here in the HTMLDocument API as well.
The use of the constructor directly opens up opportunities to bypass that state
initialization which could lead to incorrect parse results, not only for
encoding issues, but also in getting way off track in parsing. E.g. if we’re
parsing the inner contents of a TEMPLATE element then a closing BODY tag should
be ignored whereas in other parsing modes it closes the BODY element.
> For the HtmlDocument I find it less obvious that creating an empty document
> would be the default case compared to loading an existing document from a
> file.
To hit the point home, “creating an empty document” seems a little unclear in
terms of HTML as well. Are we creating am empty DOM class instance without any
actual DOM? Are we creating a <!DOCTYPE html> DOM with no nodes? If we proceed
by feeding it an HTML string and parsing then we can resolve the ambiguity, but
if we start inserting DOM nodes directly that leaves us in an undefined
starting point.
Presumably we don’t need this empty method because we could `loadString( ‘’ )`,
but that leaves open the ambiguities of what kind of empty we’re making. In my
experience with HTML, any kind of “empty” is going to need a definition of its
own, and a static creator method is the right place to define and parameterize
that, as in the updated RFC, though maybe with an additional clarification of
what “empty” means and what implications it carries for ongoing manipulation.
—
Having static creator methods with docblocks and named arguments to me is a
great way to not only guide people to use these classes safely, but also to
teach some of the nuances that have historically been overlooked in PHP’s HTML
handling, e.g. `html_entity_decode()` is unaware of of the ambiguous ampersand
and offers no way to correctly parse certain kinds of invalid named character
references.
Kindly,
Dennis Snell
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php