> On Feb 16, 2025, at 3:01 PM, Máté Kocsis <kocsismat...@gmail.com> wrote: > > > Hi Dennis, >> >> I only harp on the WhatWG spec so much because for many people this will be >> the only one they are aware of, if they are aware of any spec at all, and >> this is a sizable vector of attack targeting servers from user-supplied >> content. I’m curious to hear from folks here hat fraction of the actual PHP >> code deals with RFC3986 URLs, and of those, if the systems using them are >> truly RFC3986 systems or if the common-enough URLs are valid in both specs. >> > > I think Ignace's examples already highlighted that the two specifications > differ in nuances so much that even I had to admit after months of trying to > squeeze them into the same interface that doing so would be irresponsible. > The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs > or URIs with scheme-specific behavior - like ldap apparently), but even the > UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url > will be useful for representing browser links and any other URLs for the web > (i.e. an HTTP application router component should use this class). > >> Just to enlighten me and possibly others with less familiarity, how and when >> are RFC3986 URLs used and what are those systems supposed to do when an >> invalid URL appears, such as when dealing with percent-encodings as you >> brought up in response to Tim? >> > > I am not 100% sure what I brought up to Tim, but certainly, the biggest > difference between the two specs regarding percent-encoding was recently > documented in the RFC: > https://wiki.php.net/rfc/url_parsing_api#percent-encoding > . The other main difference is how the host component is stored: WHATWG > automatically percent-decodes it, while RFC3986 doesn't. This is summarized > in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval > section (a bit below). > >> This would be fine, knowing in hindsight that it was originally a relative >> path. Of course, this would mean that it’s critical that `https://example.com >> ` does not replace the actual host part if one is provided in `$url`. For >> example, this code should work. >> >> ``` >> $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc >> ’, ‘https://example.com >> ’ ); >> $url->domain === 'wiki.php.net >> ' >> > > Yes. it's the case. Both classes only use the base URL for relative URIs. > >> Hopefully this won’t be too controversial, even though the concept was new >> to me when I started having to reliably work with URLs. I choose the example >> I did because of human risk factors in security exploits. "xn--google.com >> " is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com >> ” >> > > I got your point, so I implemented your suggestion. Actually, I made yet > another larger API change in the meanwhile, but in any case, the WHATWG > implementation now supports IDNA the following way: > $url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null); > > echo $url->getHost(); // xn--go8h.com > > echo $url->getHostForDisplay(); // 🐘.com > echo $url->toString(); // > https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 > > echo $url->toDisplayString(); / > https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 > > > Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at > the end of https://externals.io/message/126182#126184 > ), and adding support for RFC3987 (therefore IRIs) would be a very heavy > amount of work, it's just not feasible within this RFC :( To make things > worse, its code should be written from scratch, since I haven't found any > suitable C library yet for this purpose. That's why I'll leave them for > > > On other notes, let me share some of the changes since my previous message to > the mailing list: > > > - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from > the proposal after Arnaud's feedback. Now, both the normalized (and decoded), > as well as the non-normalized representation can equally be retrieved from > the same URI instance. This was necessary to change in order for users to be > able to consistently use URIs. Now, if someone needs an exact URI component > value, they can use the getRaw*() getter. If they want the normalized and > percent-decoded form then a get*() getter should be used. For more > information, the > > > https://wiki.php.net/rfc/url_parsing_api#component_retrieval > section should be consulted. > >
This seems like a good change. > - I made a few less important API changes, like converting the WhatWgError > class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing > the return type of some getters (removing nullability) etc. > > Love this. > - I fixed quite some smaller details of the implementation along with a very > important spec incompatibility: until now, the "path" component didn't > contain the leading "/" character when it should have. Now, both classes > conform to their respective specifications with regards to path handling. > > This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set path methods and the issue of the / and %2F. - If we exposed `getPathIterator()` or `getPathSegments()` could we not report these in their fully-decoded forms? That is, because the path segments are separated by some invocation or array element, they could be decoded? - Probably more valuably, if `withPath()` accepted an array, could we not allow fully non-escaped PHP strings as path segments which the URL class could safely and by-default handle the escaping for the caller? Right now, if someone haphazardly joins path segments in order to set `withPath()` they will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I suspect this is a really minor risk. However, if they could send in an array then they would never need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting path segments which cannot be represented. > > > I think the RFC is now mature enough to consider voting in the foreseeable > future, since most of the concerns which came up until now are addressed some > way or another. However, the only remaining question that I still have is > whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? > Personally, I don't see much problem with opening them for extension (other > than some technical challenges that I already shared a few months ago), and I > think people will have legitimate use cases for extending these classes. On > the other hand, having final classes may allow us to make slightly more > significant changes without BC concerns until we have a more battle-tested > API, and of course completely eliminate the need to overcome the said > technical challenges. According to Tim, it may also result in safer code > because spec-compliant base classes cannot be extended by possibly non-spec > compliant/buggy children. I don't necessarily fully agree with this specific > concern, but here it is. > > I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for how well-written it seems, and how meticulously you have taken everyone’s feedback and incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still, here are some additional thoughts (and a previous one again) related to some of aspects, mostly naming. The HTML5 library has `::createFromString()` instead of `parse()`. Did you consider following this form? It doesn’t seem that important, but could be a nice improvement in consistency among the newer spec-compliant APIs. Further, I think `createFromString()` is a little more obvious in intent, as `parse()` is so generic. Given the issues around equivalence, what about `isEquivalent()` instead of `equals()`? In the RFC I think you have been careful to use the “equivalence” terminology, but then in the actual interface we fall back to `equals()` and lose some of the nuance. Something about not implementing `getRawScheme()` and friends in the WHATWG class seems off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw untranslated components, particularly since the “raw” part of the name already suggests some kind of danger or risk in using it as some semantic piece. Tim brought up the naming of `getHost()` and `getHostForDisplay()` as well as the correspondence with the `toString()` methods. I’m not sure if it was overlooked or I missed the followup, but I wonder what your thoughts are on passing an enum to these methods indicating the rendering context. Here’s why: I see developers reach for the first method that looks right. In this case, that would almost always be `getHost()`, yet `getHost()` or `toString()` or whatever is going to be inappropriate in many common cases. I see two ways of baking in education into the API surface: creating two symmetric methods (e.g. `getDisplayableHost()` and `getNonDisplayableHost()`); or requiring an enum forcing the choice (e.g. `getHost( ForDisplay | ForNonDisplay )`). In the case on an enum this could be equally applied across all of the relevant methods where such a distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an easy oversight impossible. Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that browsers used to decode percent-encodings in the fragment but they all stopped and this was removed from the WHATWG HTML spec [no-percent-escaping]. The RFC currently shows `getFragment()` decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates percent-encoding when _setting_ the fragment. You can test this in a browser with the following example: Chrome, Firefox, and Safari exhibit the same behavior. u = new URL(window.location) u.hash = ‘one and two’; u.hash === ‘#one%20and%20two’; u.toString() === ‘….#one%20and%20two’; So I think it may be more accurate and consistent to handle `Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting a fragment we should percent-encode the appropriate characters, but when reading it, we should never interpret those characters — it should always return the “raw” value of the fragment. [no-percent-escaping]: https://github.com/whatwg/url/issues/344 Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my comments should be understood exclusively within the WHATWG domain as I don’t have the same experience with the RFC3986 side. Dennis Snell > > > Regards, > Máté > > >