Hi Timothy, Timothy Sample <samp...@ngyro.com> skribis:
> Here’s a patch for Guile that uses explicit lists of characters in the > ‘(web uri)’ module instead of character ranges. It includes two tests > that are pretty verbose, but seem to do the trick. > > I have a bit more background on the problem, mostly coming from a Glibc > bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>. > > It turns out that it is well-known upstream, and avoiding character > ranges is the recommended approach for know. Some other GNU tools have > adopted what is being called the “Rational Range Interpretation” > <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>. > AIUI, this means they use the underlying encoding numbers for ranges (I > checked the source, but I’m only mostly sure I read it right). It looks > like the Glibc folks are unsure how to proceed on this (but are maybe > slightly leaning towards the “rational” approach). Great that you gleaned good references on this topic! > It’s all a pretty big mess, really. I was hoping there would be some > obvious thing that would fix the problem more generally. Short of > pulling in the Gnulib regex code or writing something in Scheme, it > looks like Guile is stuck where it is now. Yeah. The alternative would be to not use regexps in this context, I guess. > I’m unsure if the changes are considered “trivial” from a copyright > perspective. It’s pretty close, but I think programmers tend to > underestimate here. I’ve started the FSF copyright assignment process > either way, since is likely not my last Guile patch. :) If the process is already underway, I think it’s fine to commit this patch (I would rather wait if it were longer and/or if we didn’t know each other already). > From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001 > From: Timothy Sample <samp...@ngyro.com> > Date: Sun, 2 Jun 2019 14:41:20 -0400 > Subject: [PATCH] Make URI handling locale independent. > > Fixes <https://bugs.gnu.org/35785>. > > * module/web/uri.scm (digits, hex-digits, letters): New variables. > (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp, > userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly > list each character instead of using character ranges. > * test-suite/tests/web-uri.test: Add corresponding tests. [...] > + (pass-if "http://www.example.com (sv_SE)" > + (dynamic-wind > + (lambda () #t) > + (lambda () > + (with-locale "sv_SE.utf8" > + (reload-module (resolve-module '(web uri))) > + (uri=? (string->uri "http://www.example.com") > + #:scheme 'http #:host "www.example.com" #:path ""))) Aren’t ‘reload-module’ calls a leftover that can now be removed (also in the other test)? For the sv_SE test, what about taking a host name with a ‘w’, since that’s the use case that allowed us to uncover this bug? Apart from that it LGTM, thank you! Ludo’.