2015-11-22 21:24:05 -0800, Shivanshu Goyal: [...] > I think I found a bug which did not exist in version 2.14, but does seem to > exist in versions 2.16 and 2.22. I have not tested any other versions. > > Say there is a file with the following contents: > > shivanshu@thetis:tmp$ cat temp | xxd > 0000000: 68e2 8093 680a h...h. > > The following is the grep 2.14 command and output: > > shivanshu@thetis:tmp$ cat temp | grep -P '\xe2\x80\x93' > h–h > > The following is the grep 2.16/2.22 command and output: > > shivanshu@thetis:tmp$ cat temp | grep -P '\xe2\x80\x93' > d1y8@thetis:tmp$ [...]
If you read the pcrepattern man page, you'll see that \xe2 doesn't match the byte e2, but the character of code e2. If you're in a UTF-8 locale, \xe2 would match the character of Unicode code point e2 (LATIN SMALL LETTER A WITH CIRCUMFLEX) which in UTF-8 is written as the bytes c3 a2. The sequence e2 80 93 is actually the one character U+2013 (EN DASH). So, here, you either want: LC_ALL=C grep -P '\xe2\x80\x93' That is use a locale where characters are single-byte and their code is the byte value, or assuming the current locale is UTF-8, use: grep -P '\x{2013}' Or, regardless of the locale: grep -P '(*UTF8)\x{2013}' -- Stephane