Re: Uppercase RE matching problems in FreeBSD 11

Baptiste Daroussin Sun, 06 Nov 2016 03:08:36 -0800

On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> I happened to run an old script today that uses sed(1) to extract the system
> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as
> expected:
> 
> $ sysctl kern.boottime
> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> v  5 16:18:34 2016
> 
> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> expected:
> 
> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> Nov  5 16:18:34 2016
> 
> Testing every lowercase character separately gives even more inconsistent
> results:
> 
> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p
> > a
> > b
> > c
> > d
> > e
> > f
> > g
> > h
> > i
> > j
> > k
> > l
> > m
> > n
> > o
> > p
> > q
> > r
> > s
> > t
> > u
> > v
> > w
> > x
> > y
> > z
> > !
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
> 
> Here sed thinks every lowercase character except for 'a' is uppercase! This
> differs from the first test where sed did not think 'o' is uppercase. Again,
> the above behaves as expected with LANG=C.
> 
> Does anyone have any insight into this? This is likely to break a lot of
> existing code.
>


Yes A-Z only means uppercase in an ASCII only world in a unicode world it means
AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
we have a unicode collation instead of falling back in on LC_COLLATE=C which
means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.

Best regards,
Bapt

signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

Reply via email to