Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Dennis Snell Mon, 26 Aug 2024 15:33:48 -0700

> Hi Everyone,
> 
> I've been working on a new RFC for a while now, and time has come to 
> present it to a wider audience.
> 
> Last year, I learnt that PHP doesn't have built-in support for parsing URLs 
> according to any well established standards (RFC 1738 or the WHATWG URL 
> living standard), since the parse_url() function is optimized for 
> performance instead of correctness.
> 
> In order to improve compatibility with external tools consuming URLs (like 
> browsers), my new RFC would add a WHATWG compliant URL parser functionality 
> to the standard library. The API itself is not final by any means, the RFC 
> only represents how I imagined it first.
> 
> You can find the RFC at the following link: 
> https://wiki.php.net/rfc/url_parsing_api
> 
> Regards, 
> Máté
> 
Máté, thanks for putting this together.


Whenever I need to work with URLs there are a few things missing that I would 
love to see incorporated into any change in PHP that brings us a spec-compliant 
parsing class.

First of all, I typically care most about WhatWG URLs because the PHP code I’m 
working with is making decisions about HTML that a browser will interpret. 
Paramount above all other concerns that code on the server can understand 
content in the same way that the browsers will, otherwise we will invite 
security issues. People may have valid critiques with the WhatWG specification, 
but it’s also the most-relevant specification for users of much or most of the 
PHP code we write, and it’s valuable because it allows us to talk about URLs in 
the same way a browser would.

I’m worried about the side-effects that having a global uri.default_handler 
could have with code running differently for no apparent reason, or differently 
based on what is calling it. If someone is writing code for a controlled system 
I could see this being valuable, but if someone is writing a framework like 
WordPress and has no control over the environments in which code runs, it seems 
dangerous to hope that every plugin and every host runs compatible system 
configurations. Nobody is going to check `ini_get( ‘uri.default_handler’ )` 
before every line that parses URLs. Beyond this, even just allowing a pluggable 
parser invites broken deployments because PHP code that is reading from a 
browser or sending output to one needs to speak the language the browser is 
speaking, not some arbitrary language that’s similar to it.

> One thing I feel is missing, is a method to parse a (partial) URL relative to 
> another


Being able to parse a relative URL and know if a URL is relative or absolute 
would help WordPress, which often makes decisions differently based on this 
property (for instance, when reading an `href` property of a link). I know 
these aren’t spec-compliant URLs, but they  still represent valid values for 
URL fields in HTML and knowing if they are relative or not requires some amount 
of parsing specific details everywhere, vs. in a class that already parses 
URLs. Effectively, this would imply that PHP’s new URL parser decodes  
`document.querySelector( ‘a’ ).getAttribute( ‘href’ )`, which should be the 
same as `document.querySelector( ‘a’ ).href`, and indicates whether it found a 
full URL or only a portion of one.

  * `$url->is_relative` or `$url->is_absolute`
  * `$url->specificity = URL::Relative | URL::Absolute`

> the URI parser libraries used don't support modification of the URI

Having methods to add query arguments, change the path, etc… would be a great 
way to simplify user-space code working with URLs. For instance, read a URL and 
then add a query argument if some condition within the URL warrants it (for 
example, the path ends in `.png`).

Was it intended to add this to the RFC before it’s finalized?

> I would not make Url final. "OMG but then people can extend it!" Exactly.

My counter-point to this argument is that I see security exploits appear 
everywhere that functions which implement specifications are pluggable and 
extendable. It’s easy to see the need to create a class that limits possible 
URLs, but that also doesn’t require extending a class. A class can wrap a URL 
parser just as it could extend one. Magic methods would make it even easier.

A problem that can arise with adding additional rules onto a specification like 
this is that the subclass gets used in more places than it should and then 
somewhere some PHP code allows a malicious URL because it failed to parse and 
then the inspection rules weren’t applied.

----

Finally, I frequently find the need to be able to consider a URL in both the 
display context and the serialization context. With Ada we have 
`normalize_url()`, `parse_search_params()`, and the IDNA functions to convert 
between the two representations. In order to keep strong boundaries between 
security domains, it would be nice if PHP could expose the two variations: one 
is an encoded form of a URL that machines can easily parse while the other is a 
“plain string” in PHP that’s easier for humans to parse but which might not 
even be a valid URL. Part of the reason for this need is that I often see 
user-space code treating an entire URL as a single text span that requires one 
set of rules for full decoding; it’s multiple segments that each have their own 
decoding rules.

 - Original [ https://xn--google.com/secret/../search?q=🍔 ]
 - `$url->normalize()` [ https://xn--google.com/search?q=%F0%9F%8D%94 ]
 - `$url->for_display()` Displayed [ https://䕮䕵䕶䕱.com/search?q=🍔 ]

Having this in the RFC would give everyone the tools they need to effectively 
and safely set links within an HTML document.

----

All the best,
Dennis Snell

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to