bug#33837: Unexpected result for regex with non-ascii range

Jim Meyering Sun, 23 Dec 2018 14:02:12 -0800

tags 33873 notabug
close 33873
stop

On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4...@gmail.com> wrote:
> grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> of yY for lv_LV.UTF-8 locale (by implementing rational range
> interpretation?) [1].
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
>
> However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected 
> results:
> $ echo 
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> Ž
> $ echo 
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> a
> āĀb
> c
> čČd
...
>
> For the uppercase the result is completely bogus, but for the lowercase range
> it seems that accented uppercase letters are interleaved with the
> lowercase ones.
>
> I would expect all letters to have their uppercase variants de-interleaved 
> here.
>
> I don't know if grep alters the collation rules or it is done by glibc (2.28).
> strxfrm() gives me this result:
> Using LC_COLLATE=lv_LV.UTF-8
> char    strxfrm
> i    c2b7010201020101e29b96
> I    c2b7010201070101e2afb7
...


Thanks for the report. However, ...
Using a multi-byte character as a range endpoint elicits what the
standards documents call "unspecified behavior".

Quoting grep's own manual,

> Within a bracket expression, a "range expression" consists of two characters 
> separated by a hyphen.  It matches any single character that sorts between 
> the two characters, inclusive.  In the default C locale, the sorting sequence 
> is the native character order; for example, '[a-d]' is equivalent to 
> '[abcd]'.  In other locales, the sorting sequence is not specified, and 
> '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail 
> to match any character, or the set of characters that it matches might even 
> be erratic.  To obtain the traditional interpretation of bracket expressions, 
> you can use the 'C' locale by setting the 'LC_ALL' environment variable to 
> the value 'C'.

For the record, POSIX says this:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:

> Range expressions are, historically, an integral part of REs. However, the 
> requirements of "natural language behavior" and portability do conflict. In 
> the POSIX locale, ranges must be treated according to the collating sequence 
> and include such characters that fall within the range based on that 
> collating sequence, regardless of character values. In other locales, ranges 
> have unspecified behavior.

I am marking the auto-created issue as "not-a-bug", and can't even
(reasonably) label it as "wishlist", because allowing what your usage
implies is fundamentally contradictory.

You're welcome to continue the discussion here.

bug#33837: Unexpected result for regex with non-ascii range

Reply via email to