bug#33837: Unexpected result for regex with non-ascii range

Reinis Danne Sat, 22 Dec 2018 13:35:00 -0800

Hi!

grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
of yY for lv_LV.UTF-8 locale (by implementing rational range
interpretation?) [1].


[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774

However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
$ echo 
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
Ž
$ echo 
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
a
āĀb
c
čČd
e
ēĒf
g
ģĢh
i
īĪy
j
k
ķĶl
ļĻm
n
ņŅo
ōŌp
q
r
ŗŖs
šŠt
u
ūŪv
w
x
z
žŽ

For the uppercase the result is completely bogus, but for the lowercase range
it seems that accented uppercase letters are interleaved with the
lowercase ones.

I would expect all letters to have their uppercase variants de-interleaved here.

I don't know if grep alters the collation rules or it is done by glibc (2.28).
strxfrm() gives me this result:
Using LC_COLLATE=lv_LV.UTF-8
char    strxfrm
i    c2b7010201020101e29b96
I    c2b7010201070101e2afb7
ī    c2b70102140102020101e29bb7
Ī    c2b70102140107020101e2b096
y    c2b701030102
Y    c2b701030107
j    c382010201020101e29c96
J    c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char    strxfrm
i    6b
I    4b
ī    c4ad
Ī    c4ac
y    7b
Y    5b
j    6c
J    4c


Reinis

bug#33837: Unexpected result for regex with non-ascii range

Reply via email to