On 2024-07-22 11:25, Glenn Golden wrote:
str=$(printf "begin\xe2\x80\x99end")

#
# grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
# and exits with 1, indicating no match.
#
printf"Using grep 3.11:\n"
printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'

This asks 'grep' to output all lines containing characters in the range \x80 through \xFF. In a single-byte locale this matches any line containing a byte in that range (i.e., any byte with the top bit set), and 'grep' will output the line and exit with status zero.

However, in a UTF-8 locale this will match any line containing the characters U+0080 (a nameless control character) through U+00FF (LATIN SMALL LETTER Y WITH DIAERESIS, or "ΓΏ"). Because the bytes E2, 80, 99 in 'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so grep doesn't output anything and exits with status 1.

In short, to get the behavior your want, put LC_ALL="C" in the locale.

If pcregrep finds a match in a UTF-8 locale then that would appear to be a bug in pcregrep; you might report it to the pcregrep maintainer.



Reply via email to