Hi Vivien,
> This pushes the limits of my understanding of URIs, as I did not know > we had to consider '%2E%2E' the same as '..'. However, the RFC is not > very clear: I wasn't able to find anything that MANDATED any normalization at all, either before or after Relative Resolution. It is possible that treating %2E as a literal dot in resolve-relative-reference could count as unwanted normalization. But it's a safe operation in terms of URI equivalence* and I think users would be less confused to have %2E%2E disappear than to have it remain. Also, what if the resolve-relative-reference procedure didn't treat %2E as a dot? There isn't a uri-normalize procedure users can call afterwards to fix that. And there isn't a version of uri-decode that allows selectively decoding JUST the dot characters. Users would have to write a lot of code themselves to get proper relative-resolution, so we should do it for them. - Nathan *References for the claim that treating %2E as a literal dot is always okay: - Section 2.3: percent-encoded unreserved characters are always equivalent to decoded ones. - Section 2.4: unreserved characters can be percent-decoded at any time. - Section 6.2.2.3: dot-segments should be removed during normalization even if found outside of a relative-reference. Vivien Kraus <viv...@planete-kraus.eu> writes: > Hello Natan! > > Le jeudi 02 novembre 2023 à 16:00 -0400, Nathan a écrit : >> There is a problem and I fixed it by rewriting a bunch of code myself >> because I need similar code. > > Thank you! > >> remove-dot-segments: >> You cannot split-and-decode-uri-path and then encode-and-join-uri- >> path. >> Those are terrible functions that don't work on all URIs. >> URI schemes are allowed to specify that certain reserved characters >> (sub-delims) are special. >> In that case, a sub-delim that IS escaped is different from a sub- >> delim that IS NOT escaped. >> >> Example input to your remove-dot-segments: >> (resolve-relative-reference (string->uri-reference "/") (string->uri- >> reference "excitement://a.com/a!a!%21!")) >> Your wrong output: >> excitement://a.com/a%21a%21%21%21 > > I see. > >> >> One solution would be to only percent-decode dots. Because dot is >> unreserved, that solution doesn't have any URI equivalence issues. >> But I still think decoding dots automatically is a bad, unexpected >> side-effect to have. >> I rewrote this function so that it: >> - works on both escaped and unescaped dots >> - doesn't unescape any unnecessary characters > > This pushes the limits of my understanding of URIs, as I did not know > we had to consider '%2E%2E' the same as '..'. However, the RFC is not > very clear: > > 2.3: Unreserved Characters: > For consistency, percent-encoded octets in the ranges of ALPHA > (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), > underscore (%5F), or tilde (%7E) should not be created by URI > producers and, when found in a URI, should be decoded to their > corresponding unreserved characters by URI normalizers. > > 5.2.1: Pre-parse the Base URI: > Normalization of the base URI, as described in Sections 6.2.2 and > 6.2.3, is optional. A URI reference must be transformed to its > target URI before it can be normalized. > > Did you find something more precise than that? In any case, decoding > the dots is probably the least unsafe thing to do. > >> >> The test suite no longer needs to check for incorrect output either: >> > ;; The test suite checks for ';' characters, but Guile escapes >> > ;; them in URIs. Same for '='. >> >> ---- >> >> resolve-relative-reference: >> I rewrote this procedure so it is shorter. >> I also added #:strict? to toggle "strict parser" as mentioned in the >> RFC. > > As far as I understand, your code is correct. The tests pass. > > Thank you again! > > Best regards, > > Vivien