Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Sun, 02 Mar 2025 14:02:13 -0800

Hi Tim,

Thank you again for the thorough review!




> The naming of these methods seems to be a little inconsistent. It should
> either be:
>
>      ->getHostForDisplay()
>      ->toStringForDisplay()
>
> or
>
>      ->getDisplayHost()
>      ->toDisplayString()
>
> but not a mix between both of them.
>

Yes, I completely agree with your concern. I'm just not sure yet which
combination I'd prefer.
Probably the latter one?


> Yes. Besides the remark above, my previous arguments still apply (e.g.
> `with()`ers not being able to construct instances for subclasses,
> requiring to override all of them). I'm also noticing that serialization
> is unsafe with subclasses that add a `$__uri` property (or perhaps any
> property at all?).
>

Hm, yes, you are right indeed that withers cannot really create new
instances on
their own because the whole URI string is needed to instantiate a
new object... which is only
accessible if it's reconstructed by swapping the relevant component with
its new value.

Please note that trying to serialize a $__uri property will result in an
exception.

1.
>
> The `toDisplayString()` method that you mentioned above is not in the
> RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?
>

The toHumanFriendlyString() method stuck there from a previous version of
the proposal,
since then I converted it to toDisplayString().


> 2.
>
> The example output of the `$errors` array does not match the stub. It
> contains a `failure` property, should that be `softError` instead?
>

The $softError property is also an outdated name: I recently changed it to
$failure
to be consistent with the wording that the WHATWG specification uses.


> 3.
>
> The RFC states "When trying to instantiate a WHATWG Url via its
> constructor, a Uri\InvalidUriException is thrown when parsing results in
> a failure."
>
> What happens for Rfc3986 when passing an invalid URI to the constructor?
> Will an exception be thrown? What will the error array contain? Is it
> perhaps necessary to subclass Uri\InvalidUriException for use with
> WhatWgUrl, since `$errors` is not applicable for 3986?
>

The first two questions are answered right at the top of the parsing
section:

"the constructor: It expects a URI, and optionally, a base URL in order to
support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriException is thrown."

The $errors property will contain an empty array though, as you supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by the
implementation triggering
the error.


>
> 4.
>
> The RFC does not specify when `UninitializedUriException` is thrown.
>

That's a very good catch! I completely forgot about some exceptions. This
one is used
for indicating that an URI is not correctly initialized: when a URI
instance is created
without actually invoking the constructor, or the parse method, or
__unserialize(),
then any methods that try to use the internally stored URI will trigger
this exception.


> 5.
>
> The RFC does not specify when `UriOperationException` is thrown.


> 6.
>
> Generally speaking I believe it would help understanding if you would
> add a `/** @throws InvalidUriException */` to each of the methods in the
> stub to make it clear which ones are able to throw (e.g. resolve(), or
> the withers). It's harder to find this out from “English” rather than
> “code” :-)
>
>
Good idea! I've added the PHPDoc as well as created a dedicated "Exceptions"
section.

7.
>
> In the “Component retrieval” section: Please add even more examples of
> what kind of percent-decoding will happen. For example, it's important
> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is
> decoded to `=`. This really is the same case as with `%2F` in a path.
> The explanation
>

Thanks for calling these cases out, I've significantly reworked the
relevant sections.
First of all, I added much more details to the general overview about
percent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well
as extended
the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section
with more information
about the two component representations, and added a general clarification
related to reserved
characters. Additionally, the
https://wiki.php.net/rfc/url_parsing_api#component_modification
section
makes it clear how percent-encoding is performed when the withers are used.

After thinking about the question a lot, finally the current
encoding-decoding rules seem
logical to me, but please double-check them. It's easy to misinterpret such
long and complex
specifications.

Long story short: when parsing an URI or modifying a component, RFC 3986
fails hard if
an invalid character is found, while WHATWG implementation automatically
percent-encodes
it while also triggering a soft-error.

While retrieving the "normalized-decoded" representation of a URI component,
percent-decoding is
performed when possible:
- in case of RFC3986: reserved and invalid characters are not
percent-decoded (only unreserved ones are)
- in case of WHATWG: invalid characters and characters with special meaning
(that fall into the percent-encode set
of the given component) are not percent-decoded

The relevant sections will give a little more reasoning why I went with
these rules.

"the URI is normalized (when applicable), and then the reserved
> characters in the context of the given component are percent-decoded.
> This means that only those reserved characters are percent-decoded that
> are not allowed in a component. This behavior is needed to be able to
> unambiguously retrieve components."
>
> alone is not clear to me. “reserved characters that are not allowed in a
> component”. I assume this means that `%2F` (/) in a path will not be
> decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?
>

I hope that this question is also clear after my clarifications + the
reconsidered logic.


> 8.
>
> In the “Component retrieval” section: You compare the behavior of
> WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
>
>      $url->getRawScheme() // does not exist, because WhatWgUrl always
> normalizes the scheme
>

Done.


>
> to better point out the differences between the two APIs with regard to
> normalization (it's mentioned, but having it in the code blocks would
> make it more visible).
>

Done.


>
> 9.
>
> In the “Component Modification” section, the RFC states that WhatWgUrl
> will automatically encode `?` and `#` as necessary. Will the same happen
> for Rfc3986? Will the encoding of `#` also happen for the query-string
> component? The RFC only mentions the path component.


The above referenced sections will give a clear answer for this question as
well.
TLDR: after your message, I realized that automatic percent-encoding also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate, because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules of its
own specification,
where there are only hard errors, and there's no such thing as "automatic
correction".


> I'm also wondering if there are cases where the withers would not
> round-trip, i.e. where `$url->withPath($url->getPath())` would not
> result in the original URL?
>

I am currently not aware of any such situation... I even wrote about this
aspect fairly
long, because I think "roundtripability" is a very important attribute.
Thank you for
raising awareness of this!


>
> 10.
>
> Can you add examples where the authority / host contains IPv6 literals?
> It would be useful to specifically show whether or not the square
> brackets are returned when using the getters. It would also be
> interesting to see whether or not IPv6 addresses are normalized (e.g.
> shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).
>

Good idea again! I've added an example containing an IPv6 host at the very
end of the component retrieval section.  And yes, they will be
enclosed within a [] pair as
per the spec.

It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...


>
> 11.
>
> In “Component Recomposition” the RFC states "The
> Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".
>
> Does this mean that toString() for Rfc3986 will always return the
> original input?
>

Yes, effectively that's the case, only WHATWG modifies the input according
to my knowledge.
In the past, I had the impression that RFC 3986 also did a few changes,
but then I had to realize that it was not the case after I had dug deep
into the code of uriparser.


>
> 12.
>
> It would be useful to know whether or not the classes implement
> `__debugInfo()` / how they appear when `var_dump()`ing them.
>

I've added an example.

That's all I managed to write for now, but I'll try to answer the rest of
the messages and feedback
as soon as possible. :)

Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to