Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Tue, 19 Nov 2024 00:51:29 -0800

Hi Dennis,

Even though I didn't answer for a long time, I was improving my RFC
implementation in the meanwhile as well as evaluating your suggestions.


I’m worried about the side-effects that having a global
uri.default_handler could
> have with code running differently for no apparent reason, or differently
> based on what is calling it. If someone is writing code for a controlled
> system I could see this being valuable, but if someone is writing a
> framework like WordPress and has no control over the environments in which
> code runs, it seems dangerous to hope that every plugin and every host runs
> compatible system configurations. Nobody is going to check `ini_get(
> ‘uri.default_handler’ )` before every line that parses URLs. Beyond this,
> even just *allowing* a pluggable parser invites broken deployments
> because PHP code that is reading from a browser or sending output to one
> needs to speak the language the browser is speaking, not some arbitrary
> language that’s similar to it.
>

You convinced me with your arguments regarding the issues a global
uri.default_handler
INI config can cause, especially after having read a blog post by Daniel
Stenberg about the topic (
https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That's why I
removed this from the RFC in favor of relying on configuring the parser at
the individual feature level. However, I don't agree with removing a
pluggable parser because of the following reasons:

- the current method (parse_url() based parser) is already doomed, isn't
compliant with any spec, so it already doesn't speak the language the
browser is speaking
- even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
- in addition, there are tools which aren't compliant with the WhatWg spec,
but with some other. Most prominently, cURL is mostly RFC3986 compliant
with some additional flavour of WhatWg according to
https://everything.curl.dev/cmdline/urls/browsers.html

That's why I intend to keep support for pluggability.


> Being able to parse a relative URL and know if a URL is relative or
> absolute would help WordPress, which often makes decisions differently
> based on this property (for instance, when reading an `href` property of a
> link). I know these aren’t spec-compliant URLs, but they  still represent
> valid values for URL fields in HTML and knowing if they are relative or not
> requires some amount of parsing specific details everywhere, vs. in a class
> that already parses URLs. Effectively, this would imply that PHP’s new URL
> parser decodes  `document.querySelector( ‘a’ ).getAttribute( ‘href’ )`,
> which should be the same as `document.querySelector( ‘a’ ).href`, and
> indicates whether it found a full URL or only a portion of one.
>
>   * `$url->is_relative` or `$url->is_absolute`
>   * `$url->specificity = URL::Relative | URL::Absolute`
>

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when
the 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL, and then WhatWgUri should let you know whether the originally passed
in URI was relative or not, did I get you right? This feature is certainly
possible with RFC3986 URIs (even without the base parameter), but WhatWg
requires the above mentioned workaround for parsing + I have to look into
how this can be implemented...

Having methods to add query arguments, change the path, etc… would be a
> great way to simplify user-space code working with URLs. For instance, read
> a URL and then add a query argument if some condition within the URL
> warrants it (for example, the path ends in `.png`).
>

I managed to retain support for the "wither" methods that were originally
part of the proposal. This required using custom code for the uriparser
library, while the maintainer of Lexbor was kind enough to add native
support for modification after I submitted a feature request. However,
convenience methods for manipulating query parameters are still not part of
the RFC because it would increase the scope of the RFC even more, and due
to other issues highlighted by Ignace in his prior email:
https://externals.io/message/123997#124077. As I really want such a
feature, I'd be eager to create a followup RFC dedicated for handling query
strings.

My counter-point to this argument is that I see security exploits appear
> everywhere that functions which implement specifications are pluggable and
> extendable. It’s easy to see the need to create a class that *limits* possible
> URLs, but that also doesn’t require extending a class. A class can wrap a
> URL parser just as it could extend one. Magic methods would make it even
> easier.
>

Right now, it's only possible to plug internal URI implementation into PHP,
userland classes cannot be used, so this probably reduces the issue.
However, I recently bumped into a technical issue with URIs not being final
which I am currently trying to assess how to solve. More information is
available at one of my comments on my PR:
https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607
As far as I see the situation currently, it would probably be better to
make these classes final so that similar unforeseen issues and
inconsistencies cannot happen again (we can unfinalize them later anyway).


> Finally, I frequently find the need to be able to consider a URL in both
> the *display* context and the *serialization* context. With Ada we have
> `normalize_url()`, `parse_search_params()`, and the IDNA functions to
> convert between the two representations. In order to keep strong boundaries
> between security domains, it would be nice if PHP could expose the two
> variations: one is an encoded form of a URL that machines can easily parse
> while the other is a “plain string” in PHP that’s easier for humans to
> parse but which might not even be a valid URL. Part of the reason for this
> need is that I often see user-space code treating an entire URL as a single
> text span that requires one set of rules for full decoding; it’s multiple
> segments that each have their own decoding rules.
>
>  - Original [ https://xn--google.com/secret/../search?q=🍔 ]
>  - `$url->normalize()` [ https://xn--google.com/search?q=%F0%9F%8D%94 ]
>  - `$url->for_display()` Displayed [ https://䕮䕵䕶䕱.com/search?q=
> <https://xn--google.com/search?q=>🍔 ]
>

Even though I didn't entirely implement this suggestion, I added
normalization support:
- the normalize() method can be used to create a new URI instance whose
components are normalized based on the current object
- the toNormalizedString() method can be used when only the normalized
string representation is needed
- the newly added equalsTo() method also makes use of normalization to
better identify equal URIs

For more information, please refer to the relevant section of the RFC:
https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay()
method also seems to be useful at the first glance, but since this may be a
controversial optional feature, I'd defer it for later...

Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to