> Hi Everyone, > > I've been working on a new RFC for a while now, and time has come to > present it to a wider audience. > > Last year, I learnt that PHP doesn't have built-in support for parsing URLs > according to any well established standards (RFC 1738 or the WHATWG URL > living standard), since the parse_url() function is optimized for > performance instead of correctness. > > In order to improve compatibility with external tools consuming URLs (like > browsers), my new RFC would add a WHATWG compliant URL parser functionality > to the standard library. The API itself is not final by any means, the RFC > only represents how I imagined it first. > > You can find the RFC at the following link: > https://wiki.php.net/rfc/url_parsing_api > > Regards, > Máté > Máté, thanks for putting this together.
Whenever I need to work with URLs there are a few things missing that I would love to see incorporated into any change in PHP that brings us a spec-compliant parsing class. First of all, I typically care most about WhatWG URLs because the PHP code I’m working with is making decisions about HTML that a browser will interpret. Paramount above all other concerns that code on the server can understand content in the same way that the browsers will, otherwise we will invite security issues. People may have valid critiques with the WhatWG specification, but it’s also the most-relevant specification for users of much or most of the PHP code we write, and it’s valuable because it allows us to talk about URLs in the same way a browser would. I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check `ini_get( ‘uri.default_handler’ )` before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it. > One thing I feel is missing, is a method to parse a (partial) URL relative to > another Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an `href` property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes `document.querySelector( ‘a’ ).getAttribute( ‘href’ )`, which should be the same as `document.querySelector( ‘a’ ).href`, and indicates whether it found a full URL or only a portion of one. * `$url->is_relative` or `$url->is_absolute` * `$url->specificity = URL::Relative | URL::Absolute` > the URI parser libraries used don't support modification of the URI Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in `.png`). Was it intended to add this to the RFC before it’s finalized? > I would not make Url final. "OMG but then people can extend it!" Exactly. My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier. A problem that can arise with adding additional rules onto a specification like this is that the subclass gets used in more places than it should and then somewhere some PHP code allows a malicious URL because it failed to parse and then the inspection rules weren’t applied. ---- Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have `normalize_url()`, `parse_search_params()`, and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules. - Original [ https://xn--google.com/secret/../search?q=🍔 ] - `$url->normalize()` [ https://xn--google.com/search?q=%F0%9F%8D%94 ] - `$url->for_display()` Displayed [ https://䕮䕵䕶䕱.com/search?q=🍔 ] Having this in the RFC would give everyone the tools they need to effectively and safely set links within an HTML document. ---- All the best, Dennis Snell