Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Sun, 16 Feb 2025 14:03:17 -0800

Hi Dennis,

>
> I only harp on the WhatWG spec so much because for many people this will
> be the only one they are aware of, if they are aware of any spec at all,
> and this is a sizable vector of attack targeting servers from user-supplied
> content. I’m curious to hear from folks here hat fraction of the actual PHP
> code deals with RFC3986 URLs, and of those, if the systems using them are
> truly RFC3986 systems or if the common-enough URLs are valid in both specs.
>


I think Ignace's examples already highlighted that the two specifications
differ in nuances so much that even I had to admit after months of trying
to squeeze them into the same interface that doing so would be
irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing
URNs or URIs with scheme-specific behavior - like ldap apparently), but
even the UriInterface of PSR-7 can build upon it. On the other hand,
Uri\WhatWg\Url will be useful for representing browser links and any other
URLs for the web (i.e. an HTTP application router component should use this
class).


> Just to enlighten me and possibly others with less familiarity, how and
> when are RFC3986 URLs used and what are those systems supposed to do when
> an invalid URL appears, such as when dealing with percent-encodings as you
> brought up in response to Tim?
>

I am not 100% sure what I brought up to Tim, but certainly, the biggest
difference between the two specs regarding percent-encoding was recently
documented in the RFC:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main
difference is how the host component is stored: WHATWG automatically
percent-decodes it, while RFC3986 doesn't. This is summarized in the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit
below).


> This would be fine, knowing in hindsight that it was originally a relative
> path. Of course, this would mean that it’s critical that `
> https://example.com` does not replace the actual host part if one is
> provided in `$url`. For example, this code should work.
>
> ```
>     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘
> https://example.com’ );
>     $url->domain === 'wiki.php.net'
>

Yes. it's the case. Both classes only use the base URL for relative URIs.


> Hopefully this won’t be too controversial, even though the concept was new
> to me when I started having to reliably work with URLs. I choose the
> example I did because of human risk factors in security exploits.  "
> xn--google.com" is not in fact a Google domain, but an IDNA domain
> decoding to "䕮䕵䕶䕱.com <http://xn--google.com>”
>

I got your point, so I implemented your suggestion. Actually, I made yet
another larger API change in the meanwhile, but in any case, the WHATWG
implementation now supports IDNA the following way:

$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘";, null);

echo $url->getHost();                // xn--go8h.com
echo $url->getHostForDisplay();      // 🐘.com
echo $url->toString();               //
https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
echo $url->toDisplayString();        /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out
at the end of https://externals.io/message/126182#126184), and adding
support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's
just not feasible within this RFC :( To make things worse, its code should
be written from scratch, since I haven't found any suitable C library yet
for this purpose. That's why I'll leave them for

On other notes, let me share some of the changes since my previous message
to the mailing list:

- First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method
from the proposal after Arnaud's feedback. Now, both the normalized (and
decoded), as well as the non-normalized representation can equally be
retrieved from the same URI instance. This was necessary to change in order
for users to be able to consistently use URIs. Now, if someone needs an
exact URI component value, they can use the getRaw*() getter. If they want
the normalized and percent-decoded form then a get*() getter should be
used. For more information, the
https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should
be consulted.
- I made a few less important API changes, like converting the WhatWgError
class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing
the return type of some getters (removing nullability) etc.
- I fixed quite some smaller details of the implementation along with a
very important spec incompatibility: until now, the "path" component didn't
contain the leading "/" character when it should have. Now, both classes
conform to their respective specifications with regards to path handling.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now are
addressed some way or another. However, the only remaining question that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening them for
extension (other than some technical challenges that I already shared a few
months ago), and I think people will have legitimate use cases for
extending these classes. On the other hand, having final classes may allow
us to make slightly more significant changes without BC concerns until we
have a more battle-tested API, and of course completely eliminate the need
to overcome the said technical challenges. According to Tim, it may also
result in safer code because spec-compliant base classes cannot be extended
by possibly non-spec compliant/buggy children. I don't necessarily fully
agree with this specific concern, but here it is.

Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to