bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

KIM Taeyeob via Bug reports for GNU grep Sat, 02 Jul 2022 02:30:18 -0700

Grep (and also Sed) cannot match a certain range of Korean characterswhen it operates under LC_CTYPE=C.UTF-8 (and whatever languageenvironment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, orja_JP.UTF-8 etc.)


Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .

폿 <-- a character that is in the range [가-폿](<UAC00>~<UD3FF>)

                         is matched without any issue
$ echo 퐀 | grep .

$ <-- but a character in the range [퐀-힣](<UD400>~<UD7A3>)CANNOT be matched but it IS SUPPOSED TO bematched.


Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related with <regex.h> or <iconv.h> on Glibc, but Icouldn't find way to reproduce the bug with those, so alternatively, Ireport on Grep instead.

bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

Reply via email to