Hi! grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation of yY for lv_LV.UTF-8 locale (by implementing rational range interpretation?) [1].
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774 However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results: $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*' aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ Ž $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*' a āĀb c čČd e ēĒf g ģĢh i īĪy j k ķĶl ļĻm n ņŅo ōŌp q r ŗŖs šŠt u ūŪv w x z žŽ For the uppercase the result is completely bogus, but for the lowercase range it seems that accented uppercase letters are interleaved with the lowercase ones. I would expect all letters to have their uppercase variants de-interleaved here. I don't know if grep alters the collation rules or it is done by glibc (2.28). strxfrm() gives me this result: Using LC_COLLATE=lv_LV.UTF-8 char strxfrm i c2b7010201020101e29b96 I c2b7010201070101e2afb7 ī c2b70102140102020101e29bb7 Ī c2b70102140107020101e2b096 y c2b701030102 Y c2b701030107 j c382010201020101e29c96 J c382010201070101e2b0a4 Using LC_COLLATE=C.UTF-8 char strxfrm i 6b I 4b ī c4ad Ī c4ac y 7b Y 5b j 6c J 4c Reinis