Re: Uppercase RE matching problems in FreeBSD 11

Baptiste Daroussin Sun, 06 Nov 2016 13:06:58 -0800

On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
> 
> > Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <b...@freebsd.org>:
> > 
> > On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> >> I happened to run an old script today that uses sed(1) to extract the 
> >> system
> >> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works 
> >> as
> >> expected:
> >> 
> >> $ sysctl kern.boottime
> >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> >> v  5 16:18:34 2016
> >> 
> >> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
> >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> >> expected:
> >> 
> >> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> >> Nov  5 16:18:34 2016
> >> 
> >> Testing every lowercase character separately gives even more inconsistent
> >> results:
> >> 
> >> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/‚p
> 
> >> Here sed thinks every lowercase character except for 'a' is uppercase! This
> >> differs from the first test where sed did not think 'o' is uppercase. 
> >> Again,
> >> the above behaves as expected with LANG=C.
> >> 
> >> Does anyone have any insight into this? This is likely to break a lot of
> >> existing code.
> >> 
> > 
> > Yes A-Z only means uppercase in an ASCII only world in a unicode world it 
> > means
> > AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 
> > 11
> > we have a unicode collation instead of falling back in on LC_COLLATE=C which
> > means ascii only
> > 
> > For regrexp for example one should use the classes: :upper: or :lower:.
> 
> That is rather surprising.  Is there a normative reference for the treatment 
> of bracket expressions and character classes when using locales other than C 
> and/or encodings like UTF-8?


http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

For example:

"Regular expressions are a context-independent syntax that can represent a wide
variety of character sets and character set orderings, where these character
sets are interpreted according to the current locale. While many regular
expressions can be interpreted differently depending on the current locale, many
features, such as character class expressions, provide for contextual invariance
across locales."

Best regards,
Bapt

signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

Reply via email to