On 2024-07-22 11:25, Glenn Golden wrote:
str=$(printf "begin\xe2\x80\x99end")
#
# grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
# and exits with 1, indicating no match.
#
printf"Using grep 3.11:\n"
printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'
This asks 'grep' to output all lines containing characters in the range
\x80 through \xFF. In a single-byte locale this matches any line
containing a byte in that range (i.e., any byte with the top bit set),
and 'grep' will output the line and exit with status zero.
However, in a UTF-8 locale this will match any line containing the
characters U+0080 (a nameless control character) through U+00FF (LATIN
SMALL LETTER Y WITH DIAERESIS, or "ΓΏ"). Because the bytes E2, 80, 99 in
'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so
grep doesn't output anything and exits with status 1.
In short, to get the behavior your want, put LC_ALL="C" in the locale.
If pcregrep finds a match in a UTF-8 locale then that would appear to be
a bug in pcregrep; you might report it to the pcregrep maintainer.