Come to think of it, grep -P misbehaves badly in multibyte locales that are not UTF-8. It should report an error and exit rather than output gibberish. I installed the attached patch to catch that.

From cac91e3e233b769d60d7b5d6bc0e8afc67c0c713 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Fri, 12 Sep 2014 19:06:27 -0700
Subject: [PATCH] grep: diagnose -P in non-UTF-8 multibyte locale

* src/pcresearch.c (Pcompile):
libpcre supports only unibyte and UTF-8 locales,
so report an error and exit if used in other locales.
* NEWS: Mention this.
* tests/euc-mb: Test this.
---
 NEWS             | 3 +++
 src/pcresearch.c | 8 ++++++--
 tests/euc-mb     | 4 ++++
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 3624b76..36bb48f 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,9 @@ GNU grep NEWS                                    -*- outline 
-*-
   The GREP_OPTIONS environment variable is now obsolescent, and grep
   now warns if it is used.  Please use an alias or script instead.
 
+  In locales with multibyte character encodings other than UTF-8,
+  grep -P now reports an error and exits instead of misbehaving.
+
 * Noteworthy changes in release 2.20 (2014-06-03) [stable]
 
 ** Bug fixes
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 17e0e32..3475d4a 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -52,13 +52,17 @@ Pcompile (char const *pattern, size_t size)
   char const *ep;
   char *re = xnmalloc (4, size + 7);
   int flags = (PCRE_MULTILINE
-               | (match_icase ? PCRE_CASELESS : 0)
-               | (using_utf8 () ? PCRE_UTF8 : 0));
+               | (match_icase ? PCRE_CASELESS : 0));
   char const *patlim = pattern + size;
   char *n = re;
   char const *p;
   char const *pnul;
 
+  if (using_utf8 ())
+    flags |= PCRE_UTF8;
+  else if (MB_CUR_MAX != 1)
+    error (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
+
   /* FIXME: Remove these restrictions.  */
   if (memchr (pattern, '\n', size))
     error (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern"));
diff --git a/tests/euc-mb b/tests/euc-mb
index aa254ca..6a9a845 100755
--- a/tests/euc-mb
+++ b/tests/euc-mb
@@ -40,4 +40,8 @@ make_input BABAAB > exp || framework_failure_
 compare exp out || fail=1
 make_input BABABA |euc_grep AB; test $? = 1 || fail=1
 
+# -P supports only unibyte and UTF-8 locales.
+LC_ALL=$locale grep -P x /dev/null
+test $? = 2 || fail=1
+
 Exit $fail
-- 
1.9.3

Reply via email to