> Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <b...@freebsd.org>: > > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote: >> >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <b...@freebsd.org>: >>> >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >>>> I happened to run an old script today that uses sed(1) to extract the >>>> system >>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works >>>> as >>>> expected: >>>> >>>> $ sysctl kern.boottime >>>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 2016 >>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' >>>> v 5 16:18:34 2016 >>>> >>>> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase >>>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as >>>> expected: >>>> >>>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' >>>> Nov 5 16:18:34 2016 >>>> >>>> Testing every lowercase character separately gives even more inconsistent >>>> results: >>>> >>>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/‚p >> >>>> Here sed thinks every lowercase character except for 'a' is uppercase! This >>>> differs from the first test where sed did not think 'o' is uppercase. >>>> Again, >>>> the above behaves as expected with LANG=C. >>>> >>>> Does anyone have any insight into this? This is likely to break a lot of >>>> existing code. >>>> >>> >>> Yes A-Z only means uppercase in an ASCII only world in a unicode world it >>> means >>> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD >>> 11 >>> we have a unicode collation instead of falling back in on LC_COLLATE=C which >>> means ascii only >>> >>> For regrexp for example one should use the classes: :upper: or :lower:. >> >> That is rather surprising. Is there a normative reference for the treatment >> of bracket expressions and character classes when using locales other than C >> and/or encodings like UTF-8? > > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html > > For example: > > "Regular expressions are a context-independent syntax that can represent a > wide > variety of character sets and character set orderings, where these character > sets are interpreted according to the current locale. While many regular > expressions can be interpreted differently depending on the current locale, > many > features, such as character class expressions, provide for contextual > invariance > across locales.“
Sorry, maybe I wasn’t clear enough with my question. When a character class fits the problem, it is clearly advantageous. But under what circumstances would [A-Z] mean anything other than a character whose Unicode codepoint is between U+0041 and U+005A, inclusive? Especially given the locale in the example is en_US.UTF-8. Or, put another way, why would an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]? From reading your reference, I can see in 9.3.5.7: > In the POSIX locale, a range expression represents the set of collating > elements that fall between two elements in the collation sequence, inclusive. > In other locales, a range expression has unspecified behavior[…] So even if the observed behaviour is conforming, I’d think it’s still highly undesirable. Stefan -- Stefan Bethke <s...@lassitu.de> Fon +49 151 14070811 _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"