On 21/02/2025 13:06, Tim Düsterhus wrote:
Hi
Am 2025-02-16 23:01, schrieb Máté Kocsis:
I only harp on the WhatWG spec so much because for many people this
will
be the only one they are aware of, if they are aware of any spec at
all,
and this is a sizable vector of attack targeting servers from
user-supplied
content. I’m curious to hear from folks here hat fraction of the
actual PHP
code deals with RFC3986 URLs, and of those, if the systems using
them are
truly RFC3986 systems or if the common-enough URLs are valid in both
specs.
I think Ignace's examples already highlighted that the two
specifications
differ in nuances so much that even I had to admit after months of
trying
to squeeze them into the same interface that doing so would be
irresponsible.
I think this is also a good argument in favor of finally making the
classes final. Not making them final would allow for irresponsible
sub-classes :-)
echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // 🐘.com
echo $url->toString(); //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
The naming of these methods seems to be a little inconsistent. It
should either be:
->getHostForDisplay()
->toStringForDisplay()
or
->getDisplayHost()
->toDisplayString()
but not a mix between both of them.
I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until
now are
addressed some way or another. However, the only remaining question
that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening
them for
Yes. Besides the remark above, my previous arguments still apply (e.g.
`with()`ers not being able to construct instances for subclasses,
requiring to override all of them). I'm also noticing that
serialization is unsafe with subclasses that add a `$__uri` property
(or perhaps any property at all?).
--------------------
We already had extensive off-list discussion about the RFC and I agree
it's in a good shape now. I've given it another read and here's my
remarks:
1.
The `toDisplayString()` method that you mentioned above is not in the
RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?
2.
The example output of the `$errors` array does not match the stub. It
contains a `failure` property, should that be `softError` instead?
3.
The RFC states "When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results
in a failure."
What happens for Rfc3986 when passing an invalid URI to the
constructor? Will an exception be thrown? What will the error array
contain? Is it perhaps necessary to subclass Uri\InvalidUriException
for use with WhatWgUrl, since `$errors` is not applicable for 3986?
4.
The RFC does not specify when `UninitializedUriException` is thrown.
5.
The RFC does not specify when `UriOperationException` is thrown.
6.
Generally speaking I believe it would help understanding if you would
add a `/** @throws InvalidUriException */` to each of the methods in
the stub to make it clear which ones are able to throw (e.g.
resolve(), or the withers). It's harder to find this out from
“English” rather than “code” :-)
7.
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is
decoded to `=`. This really is the same case as with `%2F` in a path.
The explanation
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded
that are not allowed in a component. This behavior is needed to be
able to unambiguously retrieve components."
alone is not clear to me. “reserved characters that are not allowed in
a component”. I assume this means that `%2F` (/) in a path will not be
decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?
8.
In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme
to better point out the differences between the two APIs with regard
to normalization (it's mentioned, but having it in the code blocks
would make it more visible).
9.
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode `?` and `#` as necessary. Will the same
happen for Rfc3986? Will the encoding of `#` also happen for the
query-string component? The RFC only mentions the path component.
I'm also wondering if there are cases where the withers would not
round-trip, i.e. where `$url->withPath($url->getPath())` would not
result in the original URL?
10.
Can you add examples where the authority / host contains IPv6
literals? It would be useful to specifically show whether or not the
square brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).
11.
In “Component Recomposition” the RFC states "The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".
Does this mean that toString() for Rfc3986 will always return the
original input?
12.
It would be useful to know whether or not the classes implement
`__debugInfo()` / how they appear when `var_dump()`ing them.
Best regards
Tim Düsterhus
Hi Maté I just read the final proposal and here's my quick remarks it
may be possible other have already highlighted some of those remarks:
I believe there's a typo in the RFC
> All URI components - with the exception of the host - can be
retrieved in two formats:
I believe you mean - with the excepotion of the Port
0 - It is a unfortunate that there's no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?
1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay`
maybe this should be stated in the Futurscope ?
2 - I would go with final classes for both classes and promote
decoration for extension. This would reduce security issues a lot.
3 - I would make the constructor private using a `from` , `tryFrom` or
`parse` and `tryParse` methods to highlight the difference in result
4 - For consistency I would use toRawString and toString just like it is
done for components.
5 - Can the returned array from __debugInfo be used in a "normal" method
like `toComponents` naming can be changed/improve to ease migration from
parse_url or is this left for userland library ?