This one really surprised me. Learning that multibyte \s and \S had been broken since grep-2.6 did not make my day. But fixing it helped.
Here's how it started: To demonstrate the (first)bug, set up to use a UTF8 locale: export LC_ALL=en_US.UTF-8 then run this and note that it matches: $ printf '\x82\n' > in; ./grep -q '\S' in && echo match match Now, require a back-reference (forcing switch from grep's DFA matcher to use of the regex functions), and you see there is no match: $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match $ One fix would be to make it so dfaexec's \S-processing fails to match an invalid multibyte sequence, just as it's "."-processing does. That led me to this realization: Uh oh. This is worse: \s is not multi-byte aware. The two-byte "NO-BREAK SPACE" character is not matched by \s. This fails: $ printf 'a\xc2\xa0b\n'|./grep 'a\sb' $ This matches in spite of the fact that grep.texi says \s is equivalent to [[:space:]] : $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b' a b GNU grep fails: (but if I do s/\\s/[[:space:]]/ to the RE, then it does match) $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep: $ Patch attached:
0003-dfa-fix-s-and-S-to-work-for-multibyte.patch
Description: Binary data