bug#72246: Possible PCRE bug in grep 3.11

Paul Eggert Mon, 22 Jul 2024 12:01:17 -0700

On 2024-07-22 11:25, Glenn Golden wrote:

str=$(printf "begin\xe2\x80\x99end")


#
# grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
# and exits with 1, indicating no match.
#
printf"Using grep 3.11:\n"
printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'

This asks 'grep' to output all lines containing characters in the range\x80 through \xFF. In a single-byte locale this matches any linecontaining a byte in that range (i.e., any byte with the top bit set),and 'grep' will output the line and exit with status zero.

However, in a UTF-8 locale this will match any line containing thecharacters U+0080 (a nameless control character) through U+00FF (LATINSMALL LETTER Y WITH DIAERESIS, or "ÿ"). Because the bytes E2, 80, 99 in'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match sogrep doesn't output anything and exits with status 1.


In short, to get the behavior your want, put LC_ALL="C" in the locale.

If pcregrep finds a match in a UTF-8 locale then that would appear to bea bug in pcregrep; you might report it to the pcregrep maintainer.

bug#72246: Possible PCRE bug in grep 3.11

Reply via email to