Paul Eggert <egg...@cs.ucla.edu> [2024-07-22 12:00:21 -0700]: > On 2024-07-22 11:25, Glenn Golden wrote: > > str=$(printf "begin\xe2\x80\x99end") > > > > # > > # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them, > > # and exits with 1, indicating no match. > > # > > printf"Using grep 3.11:\n" > > printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]' > > This asks 'grep' to output all lines containing characters in the range \x80 > through \xFF. In a single-byte locale this matches any line containing a > byte in that range (i.e., any byte with the top bit set), and 'grep' will > output the line and exit with status zero. > > However, in a UTF-8 locale this will match any line containing the > characters U+0080 (a nameless control character) through U+00FF (LATIN SMALL > LETTER Y WITH DIAERESIS, or "ΓΏ"). Because the bytes E2, 80, 99 in 'str' > represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so grep > doesn't output anything and exits with status 1. >
Ahhhhhhhhhhh... ok, got it, thanks for the explanation. I had not realized that even literal octet-like specifications (e.g. \xNN) get 'promoted' (so to speak) to the underlying code points when interpreted in UTF-8 locales. > > If pcregrep finds a match in a UTF-8 locale then that would appear to be a > bug in pcregrep; you might report it to the pcregrep maintainer. > In looking just now at the 'pcre' package (which contains pcregrep) it seems that it is now listed as 'deprecated' in the Arch package list, so probably not worth reporting. In any case, thanks for the explanation, and sorry for the noise. - Glenn