Come to think of it, grep -P misbehaves badly in multibyte locales that
are not UTF-8. It should report an error and exit rather than output
gibberish. I installed the attached patch to catch that.
From cac91e3e233b769d60d7b5d6bc0e8afc67c0c713 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Fri, 12 Sep 2014 19:06:27 -0700
Subject: [PATCH] grep: diagnose -P in non-UTF-8 multibyte locale
* src/pcresearch.c (Pcompile):
libpcre supports only unibyte and UTF-8 locales,
so report an error and exit if used in other locales.
* NEWS: Mention this.
* tests/euc-mb: Test this.
---
NEWS | 3 +++
src/pcresearch.c | 8 ++++++--
tests/euc-mb | 4 ++++
3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/NEWS b/NEWS
index 3624b76..36bb48f 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,9 @@ GNU grep NEWS -*- outline
-*-
The GREP_OPTIONS environment variable is now obsolescent, and grep
now warns if it is used. Please use an alias or script instead.
+ In locales with multibyte character encodings other than UTF-8,
+ grep -P now reports an error and exits instead of misbehaving.
+
* Noteworthy changes in release 2.20 (2014-06-03) [stable]
** Bug fixes
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 17e0e32..3475d4a 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -52,13 +52,17 @@ Pcompile (char const *pattern, size_t size)
char const *ep;
char *re = xnmalloc (4, size + 7);
int flags = (PCRE_MULTILINE
- | (match_icase ? PCRE_CASELESS : 0)
- | (using_utf8 () ? PCRE_UTF8 : 0));
+ | (match_icase ? PCRE_CASELESS : 0));
char const *patlim = pattern + size;
char *n = re;
char const *p;
char const *pnul;
+ if (using_utf8 ())
+ flags |= PCRE_UTF8;
+ else if (MB_CUR_MAX != 1)
+ error (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
+
/* FIXME: Remove these restrictions. */
if (memchr (pattern, '\n', size))
error (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern"));
diff --git a/tests/euc-mb b/tests/euc-mb
index aa254ca..6a9a845 100755
--- a/tests/euc-mb
+++ b/tests/euc-mb
@@ -40,4 +40,8 @@ make_input BABAAB > exp || framework_failure_
compare exp out || fail=1
make_input BABABA |euc_grep AB; test $? = 1 || fail=1
+# -P supports only unibyte and UTF-8 locales.
+LC_ALL=$locale grep -P x /dev/null
+test $? = 2 || fail=1
+
Exit $fail
--
1.9.3