Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Dennis Snell Wed, 05 Mar 2025 14:47:47 -0800

> On Feb 16, 2025, at 3:01 PM, Máté Kocsis <kocsismat...@gmail.com> wrote:
> 
> 
> Hi Dennis,
>> 
>> I only harp on the WhatWG spec so much because for many people this will be 
>> the only one they are aware of, if they are aware of any spec at all, and 
>> this is a sizable vector of attack targeting servers from user-supplied 
>> content. I’m curious to hear from folks here hat fraction of the actual PHP 
>> code deals with RFC3986 URLs, and of those, if the systems using them are 
>> truly RFC3986 systems or if the common-enough URLs are valid in both specs.
>> 
> 
> I think Ignace's examples already highlighted that the two specifications 
> differ in nuances so much that even I had to admit after months of trying to 
> squeeze them into the same interface that doing so would be irresponsible.
> The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs 
> or URIs with scheme-specific behavior - like ldap apparently), but even the 
> UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url 
> will be useful for representing browser links and any other URLs for the web 
> (i.e. an HTTP application router component should use this class).
>  
>> Just to enlighten me and possibly others with less familiarity, how and when 
>> are RFC3986 URLs used and what are those systems supposed to do when an 
>> invalid URL appears, such as when dealing with percent-encodings as you 
>> brought up in response to Tim?
>> 
> 
> I am not 100% sure what I brought up to Tim, but certainly, the biggest 
> difference between the two specs regarding percent-encoding was recently 
> documented in the RFC: 
> https://wiki.php.net/rfc/url_parsing_api#percent-encoding
> . The other main difference is how the host component is stored: WHATWG 
> automatically percent-decodes it, while RFC3986 doesn't. This is summarized 
> in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section (a bit below).
>   
>> This would be fine, knowing in hindsight that it was originally a relative 
>> path. Of course, this would mean that it’s critical that `https://example.com
>> ` does not replace the actual host part if one is provided in `$url`. For 
>> example, this code should work.
>> 
>> ```
>>     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc
>> ’, ‘https://example.com
>> ’ );
>>     $url->domain === 'wiki.php.net
>> '
>> 
> 
> Yes. it's the case. Both classes only use the base URL for relative URIs.
>  
>> Hopefully this won’t be too controversial, even though the concept was new 
>> to me when I started having to reliably work with URLs. I choose the example 
>> I did because of human risk factors in security exploits.  "xn--google.com
>> " is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com
>> ”
>> 
> 
> I got your point, so I implemented your suggestion. Actually, I made yet 
> another larger API change in the meanwhile, but in any case, the WHATWG 
> implementation now supports IDNA the following way:
> $url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘";, null);
> 
> echo $url->getHost();                // xn--go8h.com
> 
> echo $url->getHostForDisplay();      // 🐘.com
> echo $url->toString();               // 
> https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
> 
> echo $url->toDisplayString();        / 
> https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 
> 
> 
> Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at 
> the end of https://externals.io/message/126182#126184
> ), and adding support for RFC3987 (therefore IRIs) would be a very heavy 
> amount of work, it's just not feasible within this RFC :( To make things 
> worse, its code should be written from scratch, since I haven't found any 
> suitable C library yet for this purpose. That's why I'll leave them for
> 
> 
> On other notes, let me share some of the changes since my previous message to 
> the mailing list:
> 
> 
> - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from 
> the proposal after Arnaud's feedback. Now, both the normalized (and decoded), 
> as well as the non-normalized representation can equally be retrieved from 
> the same URI instance. This was necessary to change in order for users to be 
> able to consistently use URIs. Now, if someone needs an exact URI component 
> value, they can use the getRaw*() getter. If they want the normalized and 
> percent-decoded form then a get*() getter should be used. For more 
> information, the
> 
> >  https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section should be consulted.
> 
>


This seems like a good change.


> - I made a few less important API changes, like converting the WhatWgError 
> class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing 
> the return type of some getters (removing nullability) etc.
> 
> 

Love this.


> - I fixed quite some smaller details of the implementation along with a very 
> important spec incompatibility: until now, the "path" component didn't 
> contain the leading "/" character when it should have. Now, both classes 
> conform to their respective specifications with regards to path handling.
> 
> 

This is a late thought, and surely amenable to a later RFC, but I was thinking 
about the get/set path methods and the issue of the / and %2F.


 - If we exposed `getPathIterator()` or `getPathSegments()` could we not report 
these in their fully-decoded forms? That is, because the path segments are 
separated by some invocation or array element, they could be decoded?
 - Probably more valuably, if `withPath()` accepted an array, could we not 
allow fully non-escaped PHP strings as path segments which the URL class could 
safely and by-default handle the escaping for the caller?



Right now, if someone haphazardly joins path segments in order to set 
`withPath()` they will likely be unaware of that nuance and get the path wrong. 
On the grand scale of things, I suspect this is a really minor risk. However, 
if they could send in an array then they would never need to be aware of that 
nuance in order to provide a fully-reliable URL, up to the class rejecting path 
segments which cannot be represented.



> 
> 
> I think the RFC is now mature enough to consider voting in the foreseeable 
> future, since most of the concerns which came up until now are addressed some 
> way or another. However, the only remaining question that I still have is 
> whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? 
> Personally, I don't see much problem with opening them for extension (other 
> than some technical challenges that I already shared a few months ago), and I 
> think people will have legitimate use cases for extending these classes. On 
> the other hand, having final classes may allow us to make slightly more 
> significant changes without BC concerns until we have a more battle-tested 
> API, and of course completely eliminate the need to overcome the said 
> technical challenges. According to Tim, it may also result in safer code 
> because spec-compliant base classes cannot be extended by possibly non-spec 
> compliant/buggy children. I don't necessarily fully agree with this specific 
> concern, but here it is.
> 
> 

I’ve taken another fresh and full review of the RFC and I just want to share my 
appreciation for how well-written it seems, and how meticulously you have taken 
everyone’s feedback and incorporated it. It seems mature enough to me as well, 
and I think it’s in a good place. Still, here are some additional thoughts (and 
a previous one again) related to some of aspects, mostly naming.


The HTML5 library has `::createFromString()` instead of `parse()`. Did you 
consider following this form? It doesn’t seem that important, but could be a 
nice improvement in consistency among the newer spec-compliant APIs. Further, I 
think `createFromString()` is a little more obvious in intent, as `parse()` is 
so generic.


Given the issues around equivalence, what about `isEquivalent()` instead of 
`equals()`? In the RFC I think you have been careful to use the “equivalence” 
terminology, but then in the actual interface we fall back to `equals()` and 
lose some of the nuance.


Something about not implementing `getRawScheme()` and friends in the WHATWG 
class seems off. Your rationale makes sense, but then I wonder what the problem 
is in exposing the raw untranslated components, particularly since the “raw” 
part of the name already suggests some kind of danger or risk in using it as 
some semantic piece.


Tim brought up the naming of `getHost()` and `getHostForDisplay()` as well as 
the correspondence with the `toString()` methods. I’m not sure if it was 
overlooked or I missed the followup, but I wonder what your thoughts are on 
passing an enum to these methods indicating the rendering context. Here’s why: 
I see developers reach for the first method that looks right. In this case, 
that would almost always be `getHost()`, yet `getHost()` or `toString()` or 
whatever is going to be inappropriate in many common cases. I see two ways of 
baking in education into the API surface: creating two symmetric methods (e.g. 
`getDisplayableHost()` and `getNonDisplayableHost()`); or requiring an enum 
forcing the choice (e.g. `getHost( ForDisplay | ForNonDisplay )`). In the case 
on an enum this could be equally applied across all of the relevant methods 
where such a distinction exists. On one hand this could be seen as forcing 
callers to make a choice, but on the other hand it can also be seen as a 
safeguard against an extremely-common foot-gun, making such an easy oversight 
impossible.


Just this week I stumbled upon an issue with escaping the hash/fragment part of 
a URL. I think that browsers used to decode percent-encodings in the fragment 
but they all stopped and this was removed from the WHATWG HTML spec 
[no-percent-escaping]. The RFC currently shows `getFragment()` decoding 
percent-encoded fragments, However, I believe that the WHATWG URL spec only 
indicates percent-encoding when _setting_ the fragment. You can test this in a 
browser with the following example: Chrome, Firefox, and Safari exhibit the 
same behavior.


    u = new URL(window.location)
    u.hash = ‘one and two’;
    u.hash === ‘#one%20and%20two’;
    u.toString() === ‘….#one%20and%20two’;


So I think it may be more accurate and consistent to handle 
`Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting a 
fragment we should percent-encode the appropriate characters, but when reading 
it, we should never interpret those characters — it should always return the 
“raw” value of the fragment.


[no-percent-escaping]: https://github.com/whatwg/url/issues/344



Once again, thank you for the great work you’ve put into this. I’m so excited 
to have it. All my comments should be understood exclusively within the WHATWG 
domain as I don’t have the same experience with the RFC3986 side.


Dennis Snell


> 
> 
> Regards,
> Máté
> 
> 
>

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to