tags 33873 notabug close 33873 stop On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4...@gmail.com> wrote: > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation > of yY for lv_LV.UTF-8 locale (by implementing rational range > interpretation?) [1]. > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774 > > However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected > results: > $ echo > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*' > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ > Ž > $ echo > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*' > a > āĀb > c > čČd ... > > For the uppercase the result is completely bogus, but for the lowercase range > it seems that accented uppercase letters are interleaved with the > lowercase ones. > > I would expect all letters to have their uppercase variants de-interleaved > here. > > I don't know if grep alters the collation rules or it is done by glibc (2.28). > strxfrm() gives me this result: > Using LC_COLLATE=lv_LV.UTF-8 > char strxfrm > i c2b7010201020101e29b96 > I c2b7010201070101e2afb7 ...
Thanks for the report. However, ... Using a multi-byte character as a range endpoint elicits what the standards documents call "unspecified behavior". Quoting grep's own manual, > Within a bracket expression, a "range expression" consists of two characters > separated by a hyphen. It matches any single character that sorts between > the two characters, inclusive. In the default C locale, the sorting sequence > is the native character order; for example, '[a-d]' is equivalent to > '[abcd]'. In other locales, the sorting sequence is not specified, and > '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail > to match any character, or the set of characters that it matches might even > be erratic. To obtain the traditional interpretation of bracket expressions, > you can use the 'C' locale by setting the 'LC_ALL' environment variable to > the value 'C'. For the record, POSIX says this: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html: > Range expressions are, historically, an integral part of REs. However, the > requirements of "natural language behavior" and portability do conflict. In > the POSIX locale, ranges must be treated according to the collating sequence > and include such characters that fall within the range based on that > collating sequence, regardless of character values. In other locales, ranges > have unspecified behavior. I am marking the auto-created issue as "not-a-bug", and can't even (reasonably) label it as "wishlist", because allowing what your usage implies is fundamentally contradictory. You're welcome to continue the discussion here.