bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte

Jim Meyering Sun, 22 Sep 2013 22:18:39 -0700

This one really surprised me.
Learning that multibyte \s and \S had been broken since grep-2.6 did
not make my day. But fixing it helped.


Here's how it started:

To demonstrate the (first)bug, set up to use a UTF8 locale:

    export LC_ALL=en_US.UTF-8

then run this and note that it matches:

    $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
    match

Now, require a back-reference (forcing switch from grep's DFA matcher
to use of the regex functions), and you see there is no match:

    $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
    $

One fix would be to make it so dfaexec's \S-processing fails to match an
invalid multibyte sequence, just as it's "."-processing does.
That led me to this realization:

Uh oh.  This is worse: \s is not multi-byte aware.
The two-byte "NO-BREAK SPACE" character is not matched by \s.

This fails:
    $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
    $

This matches in spite of the fact that grep.texi says \s is
     equivalent to [[:space:]] :
    $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
    a b

GNU grep fails:
(but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
    $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
    $

Patch attached:

0003-dfa-fix-s-and-S-to-work-for-multibyte.patch
Description: Binary data

bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte

Reply via email to