Hi Dennis, > > I only harp on the WhatWG spec so much because for many people this will > be the only one they are aware of, if they are aware of any spec at all, > and this is a sizable vector of attack targeting servers from user-supplied > content. I’m curious to hear from folks here hat fraction of the actual PHP > code deals with RFC3986 URLs, and of those, if the systems using them are > truly RFC3986 systems or if the common-enough URLs are valid in both specs. >
I think Ignace's examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible. The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class). > Just to enlighten me and possibly others with less familiarity, how and > when are RFC3986 URLs used and what are those systems supposed to do when > an invalid URL appears, such as when dealing with percent-encodings as you > brought up in response to Tim? > I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit below). > This would be fine, knowing in hindsight that it was originally a relative > path. Of course, this would mean that it’s critical that ` > https://example.com` does not replace the actual host part if one is > provided in `$url`. For example, this code should work. > > ``` > $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘ > https://example.com’ ); > $url->domain === 'wiki.php.net' > Yes. it's the case. Both classes only use the base URL for relative URIs. > Hopefully this won’t be too controversial, even though the concept was new > to me when I started having to reliably work with URLs. I choose the > example I did because of human risk factors in security exploits. " > xn--google.com" is not in fact a Google domain, but an IDNA domain > decoding to "䕮䕵䕶䕱.com <http://xn--google.com>” > I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way: $url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null); echo $url->getHost(); // xn--go8h.com echo $url->getHostForDisplay(); // 🐘.com echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 echo $url->toDisplayString(); / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it's just not feasible within this RFC :( To make things worse, its code should be written from scratch, since I haven't found any suitable C library yet for this purpose. That's why I'll leave them for On other notes, let me share some of the changes since my previous message to the mailing list: - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should be consulted. - I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc. - I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the "path" component didn't contain the leading "/" character when it should have. Now, both classes conform to their respective specifications with regards to path handling. I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don't see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with this specific concern, but here it is. Regards, Máté