bug#16867: [bug #37600] grep -w cuts words on non-ascii

2014-02-24 Thread Jim Meyering
Re the savannah bug report, http://savannah.gnu.org/bugs/?37600 [Let's continue on the mailing list -- now our preferred medium] On Mon, Feb 24, 2014 at 6:57 AM, Stephane Chazelas wrote: [...] Thanks for the report. I confirm it is still a problem with the latest, grep-2.18: [Note that there's no

bug#16865: grep -wP and backreferences

2014-02-24 Thread Stephane Chazelas
Hello, Backreferences don't work with -w or -x in combination with -P: $ echo aa | grep -Pw '(.)\1' $ Or they work in an unexpected way: $ echo aa | grep -Pw '(.)\2' aa The fix is simple: --- src/pcresearch.c~ 2014-02-24 09:59:56.864374362 + +++ src/pcresearch.c2014-02-24 07:33:04.

bug#16865: grep -wP and backreferences

2014-02-24 Thread Jim Meyering
On Mon, Feb 24, 2014 at 2:01 AM, Stephane Chazelas wrote: > Hello, > > Backreferences don't work with -w or -x in combination with -P: > > $ echo aa | grep -Pw '(.)\1' > $ > > Or they work in an unexpected way: > > $ echo aa | grep -Pw '(.)\2' > aa > > The fix is simple: > > > --- src/pcresearch.c

bug#16865: grep -wP and backreferences

2014-02-24 Thread Stephane Chazelas
Fine by me, thanks. BTW, as discussed in another bug, the -w/-x invalidate the (*UCP) and other PCRE special sequences. Chances are we can't easily do much about it, but it may still be worth documenting. Like, one should use grep -P '(*UCP)\bword\b' as grep -wP '(*UCP)word' won't work (pcreg

bug#16867: [bug #37600] grep -w cuts words on non-ascii

2014-02-24 Thread Stephane Chazelas
2014-02-24 08:53:17 -0800, Jim Meyering: [...] > This is pretty serious: > > $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p > père I gets more complicated with combining characters: $ printf 'pe\314\200re\n' | grep -w pe père You can't expect \w to match U+0300 alone. You can't e

bug#16865: grep -wP and backreferences

2014-02-24 Thread Jim Meyering
On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas wrote: > A last note: with -w, pcregrep wraps the regexp in \b...\b > instead of \b(?:...)\b, so it could be that those brackets are > not necessary in the first place. > > Sorry I lied, it was not the last note ;-). Note the difference: > > $ ech

bug#16871: problems about matching newline (with -z)

2014-02-24 Thread Stephane Chazelas
The doc has a confusing statement: > 15. How can I match across lines? > >Standard grep cannot do this, as it is fundamentally line-based. >Therefore, merely using the '[:space:]' character class does not >match newlines in the way you might expect. However, if your grep >is compi