bug#43577: wrong result for grep -io in turkish locale

Paul Eggert Wed, 23 Sep 2020 19:58:50 -0700

On 9/23/20 6:47 PM, Norihiro Tanaka wrote:

I attach the fix for the bug.  Regex is fixed in Paul, thank you.

Thanks, I had written a similar patch, and your patch helped me find a bug inwhat I wrote. The patch I wrote uses an auxiliary ok_fold table that letsfgrep_icase_charlen avoid calling mbrtwoc for single-byte characters in thepattern; this may help performance for long patterns. More important,fgrep_icase_charlen does not return -1 for a character like 'a' in anen_US.UTF-8 locale merely because 'a' has a case folded counterpart 'A'; theidea is that we should be OK if the case folded counterparts are single-byte.

I had added more-extensive tests than were in your patch, and some of them founda crash in kwsinit that indicated a similar change is needed there. I assumethis was because the patch I wrote had a more-generous fgrep_icase_charlen. Asthis simplifies kwsinit, this patch does that too.

While looking into this I found a performance glitch I recently introduced (Idouble-counted some regular expressions, messing up later heuristics). Plus Ichecked on this on our old Solaris 10 box and fixed a couple of portingglitches. I installed the attached patches, into the master branch, to help makeit easier for you to compare your changes to mine. Patch 0003 is the enhancedversion of the patch that you wrote.


Thanks again for working on this.

>From 545bd506efcd6cab4f28c07a438868f14b7dc1d2 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 23 Sep 2020 10:52:12 -0700
Subject: [PATCH 1/5] grep: fix recently-introduced performance glitch

* src/grep.c (main): Do not double-increment update_patterns.
update_patterns increments n_patterns now; do not increment it
again, as the incorrect count would hurt performance heuristics later.
---
 src/grep.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/grep.c b/src/grep.c
index 1453b14..11856d8 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2881,7 +2881,6 @@ main (int argc, char **argv)
       ptrdiff_t patlen = strlen (keys);
       keys[patlen] = '\n';
       keycc = update_patterns (keys, 0, patlen + 1, "");
-      n_patterns++;
     }
   else
     usage (EXIT_TROUBLE);
-- 
2.17.1

>From 4af448a142b1f78be4920d2bd2aedd2b748a1289 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 23 Sep 2020 17:07:36 -0700
Subject: [PATCH 2/5] build: update gnulib submodule to latest

* NEWS: Mention Bug#43577, which this fixes.
---
 NEWS   | 6 ++++++
 gnulib | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index 442d4d2..36e423d 100644
--- a/NEWS
+++ b/NEWS
@@ -32,6 +32,12 @@ GNU grep NEWS                                    -*- outline -*-
   constituent just before what would otherwise be a word match.
   [Bug#43225 introduced in grep 2.28]
 
+  grep -i no longer mishandles ASCII characters that match multibyte
+  characters.  For example, 'LC_ALL=tr_TR.utf8 grep -i i' no longer
+  dumps core merely because 'i' matches 'İ' (U+0130 LATIN CAPITAL
+  LETTER I WITH DOT ABOVE) in Turkish when ignoring case.
+  [Bug#43577 introduced in grep 3.4]
+
   A performance regression with -E and many patterns has been mostly fixed.
   "Mostly" as there is a performance tradeoff between Bug#22357 and Bug#40634.
   [Bug#40634 introduced in grep 2.28]
diff --git a/gnulib b/gnulib
index 4a3aec7..0c487ff 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit 4a3aec702f994f3a16e4bc6c51f2c0ae3dd76a02
+Subproject commit 0c487ff1286660c4d572c3277e73ac6618ba832d
-- 
2.17.1

>From 678f829c869059cd9cb0fe38b87880ef0a78d210 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 23 Sep 2020 18:57:57 -0700
Subject: [PATCH 3/5] grep: fix more Turkish-eyes bugs

Fix more bugs recently uncovered by Norihiro Tanaka (Bug#43577).
* NEWS: Mention new bug report.
* src/grep.c (ok_fold): New static var.
(setup_ok_fold): New function.
(fgrep_icase_charlen): Reject single-byte characters
if they match some multibyte characters when ignoring case.
This part of the patch is partly derived from
<https://bugs.gnu.org/43577#14>, which means it is:
Co-authored-by: Norihiro Tanaka <nori...@kcn.ne.jp>
(main): Call setup_ok_fold if ok_fold might be needed.
* src/searchutils.c (kwsinit): With the grep.c changes,
this code can now revert to classic 7th Edition Unix style;
aborting would be wrong.
* tests/turkish-eyes: Add tests for these bugs.
---
 NEWS               |   2 +-
 src/grep.c         | 116 +++++++++++++++++++++++++++++++--------------
 src/searchutils.c  |  23 ++-------
 tests/turkish-eyes |  18 +++++--
 4 files changed, 102 insertions(+), 57 deletions(-)

diff --git a/NEWS b/NEWS
index 36e423d..ab00ff2 100644
--- a/NEWS
+++ b/NEWS
@@ -36,7 +36,7 @@ GNU grep NEWS                                    -*- outline -*-
   characters.  For example, 'LC_ALL=tr_TR.utf8 grep -i i' no longer
   dumps core merely because 'i' matches 'İ' (U+0130 LATIN CAPITAL
   LETTER I WITH DOT ABOVE) in Turkish when ignoring case.
-  [Bug#43577 introduced in grep 3.4]
+  [Bug#43577 introduced partly in grep 2.28 and partly in grep 3.4]
 
   A performance regression with -E and many patterns has been mostly fixed.
   "Mostly" as there is a performance tradeoff between Bug#22357 and Bug#40634.
diff --git a/src/grep.c b/src/grep.c
index 11856d8..1a52c89 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2300,37 +2300,75 @@ contains_encoding_error (char const *pat, size_t patlen)
   return false;
 }
 
+/* When ignoring case and (-E or -F or -G), then for each single-byte
+   character I, ok_fold[I] is 1 if every case folded counterpart of I
+   is also single-byte, and is -1 otherwise.  */
+static signed char ok_fold[NCHAR];
+static void
+setup_ok_fold (void)
+{
+  for (int i = 0; i < NCHAR; i++)
+    {
+      wint_t wi = localeinfo.sbctowc[i];
+      if (wi == WEOF)
+        continue;
+
+      int ok = 1;
+      wchar_t folded[CASE_FOLDED_BUFSIZE];
+      for (int n = case_folded_counterparts (wi, folded); 0 <= --n; )
+        {
+          char buf[MB_LEN_MAX];
+          mbstate_t s = { 0 };
+          if (wcrtomb (buf, folded[n], &s) != 1)
+            {
+              ok = -1;
+              break;
+            }
+        }
+      ok_fold[i] = ok;
+    }
+}
+
 /* Return the number of bytes in the initial character of PAT, of size
    PATLEN, if Fcompile can handle that character.  Return -1 if
    Fcompile cannot handle it.  MBS is the multibyte conversion state.
-
-   Fcompile can handle a character C if C is single-byte, or if C has no
-   case folded counterparts and toupper translates none of its bytes.  */
+   PATLEN must be nonzero.  */
 
 static int
 fgrep_icase_charlen (char const *pat, size_t patlen, mbstate_t *mbs)
 {
-  int n = localeinfo.sbclen[to_uchar (*pat)];
-  if (n < 0)
+  unsigned char pat0 = pat[0];
+
+  /* If PAT starts with a single-byte character, Fcompile works if
+     every case folded counterpart is also single-byte.  */
+  if (localeinfo.sbctowc[pat0] != WEOF)
+    return ok_fold[pat0];
+
+  wchar_t wc;
+  size_t wn = mbrtowc (&wc, pat, patlen, mbs);
+
+  /* If PAT starts with an encoding error, Fcompile does not work.  */
+  if (MB_LEN_MAX < wn)
+    return -1;
+
+  /* PAT starts with a multibyte character.  Fcompile works if the
+     character has no case folded counterparts and toupper translates
+     none of its encoding's bytes.  */
+  wchar_t folded[CASE_FOLDED_BUFSIZE];
+  if (case_folded_counterparts (wc, folded))
+    return -1;
+  for (int i = wn; 0 < --i; )
     {
-      wchar_t wc;
-      wchar_t folded[CASE_FOLDED_BUFSIZE];
-      size_t wn = mbrtowc (&wc, pat, patlen, mbs);
-      if (MB_LEN_MAX < wn || case_folded_counterparts (wc, folded))
+      unsigned char c = pat[i];
+      if (toupper (c) != c)
         return -1;
-      for (int i = wn; 0 < --i; )
-        {
-          unsigned char c = pat[i];
-          if (toupper (c) != c)
-            return -1;
-        }
-      n = wn;
     }
-  return n;
+  return wn;
 }
 
 /* Return true if the -F patterns PAT, of size PATLEN, contain only
-   single-byte characters or characters not subject to case folding,
+   single-byte characters that case-fold only to single-byte
+   characters, or multibyte characters not subject to case folding,
    and so can be processed by Fcompile.  */
 
 static bool
@@ -2950,26 +2988,34 @@ main (int argc, char **argv)
   if (matcher < 0)
     matcher = G_MATCHER_INDEX;
 
-  /* In a single-byte locale, switch from -F to -G if it is a single
-     pattern that matches words, where -G is typically faster.  In a
-     multi-byte locale, switch if the patterns have an encoding error
-     (where -F does not work) or if -i and the patterns will not work
-     for -iF.  */
   if (matcher == F_MATCHER_INDEX
-      && (! localeinfo.multibyte
-          ? n_patterns == 1 && match_words
-          : (contains_encoding_error (keys, keycc)
-             || (match_icase && !fgrep_icase_available (keys, keycc)))))
+      || matcher == E_MATCHER_INDEX || matcher == G_MATCHER_INDEX)
     {
-      fgrep_to_grep_pattern (&pattern_array, &keycc);
-      keys = pattern_array;
-      matcher = G_MATCHER_INDEX;
+      if (match_icase)
+        setup_ok_fold ();
+
+      /* In a single-byte locale, switch from -F to -G if it is a single
+         pattern that matches words, where -G is typically faster.  In a
+         multibyte locale, switch if the patterns have an encoding error
+         (where -F does not work) or if -i and the patterns will not work
+         for -iF.  */
+      if (matcher == F_MATCHER_INDEX)
+        {
+          if (! localeinfo.multibyte
+              ? n_patterns == 1 && match_words
+              : (contains_encoding_error (keys, keycc)
+                 || (match_icase && !fgrep_icase_available (keys, keycc))))
+            {
+              fgrep_to_grep_pattern (&pattern_array, &keycc);
+              keys = pattern_array;
+              matcher = G_MATCHER_INDEX;
+            }
+        }
+      /* With two or more patterns, if -F works then switch from either -E
+         or -G, as -F is probably faster then.  */
+      else if (1 < n_patterns)
+        matcher = try_fgrep_pattern (matcher, keys, &keycc);
     }
-  /* With two or more patterns, if -F works then switch from either -E
-     or -G, as -F is probably faster then.  */
-  else if ((matcher == G_MATCHER_INDEX || matcher == E_MATCHER_INDEX)
-           && 1 < n_patterns)
-    matcher = try_fgrep_pattern (matcher, keys, &keycc);
 
   execute = matchers[matcher].execute;
   compiled_pattern =
diff --git a/src/searchutils.c b/src/searchutils.c
index c4bb802..aa11063 100644
--- a/src/searchutils.c
+++ b/src/searchutils.c
@@ -48,24 +48,11 @@ kwsinit (bool mb_trans)
   if (match_icase && (MB_CUR_MAX == 1 || mb_trans))
     {
       trans = xmalloc (NCHAR);
-      if (MB_CUR_MAX == 1)
-        for (int i = 0; i < NCHAR; i++)
-          trans[i] = toupper (i);
-      else
-        for (int i = 0; i < NCHAR; i++)
-          {
-            wint_t wc = localeinfo.sbctowc[i];
-            wint_t uwc = towupper (wc);
-            if (uwc != wc)
-              {
-                mbstate_t mbs = { 0 };
-                size_t len = wcrtomb (&trans[i], uwc, &mbs);
-                if (len != 1)
-                  abort ();
-              }
-            else
-              trans[i] = i;
-          }
+      /* If I is a single-byte character that becomes a different
+         single-byte character when uppercased, set trans[I]
+         to that character.  Otherwise, set trans[I] to I.  */
+      for (int i = 0; i < NCHAR; i++)
+        trans[i] = toupper (i);
     }
 
   return kwsalloc (trans);
diff --git a/tests/turkish-eyes b/tests/turkish-eyes
index ba1ea33..879b59d 100755
--- a/tests/turkish-eyes
+++ b/tests/turkish-eyes
@@ -36,11 +36,23 @@ i=$(printf '\304\261') # lowercase dotless i
 
       data="I:$I $i:i"
 search_str="$i:i I:$I"
-printf "$data\n" > in || framework_failure_
+printf "$data\\n" > in || framework_failure_
 
 for opt in -E -F -G; do
-  LC_ALL=$L grep $opt -i "$search_str" in > out || fail=1
-  compare out in || fail=1
+  for pat in i I "$i" "$I" " " : "$search_str"; do
+    LC_ALL=$L grep $opt -i "$pat" in > out || fail=1
+    compare in out || fail=1
+
+    case $pat in
+      i|"$I") printf "$I\\ni\\n";;
+      I|"$i") printf "I\\n$i\\n";;
+      :) printf ":\\n:\\n";;
+      ' ') printf " \\n";;
+      *) cat in;;
+    esac >exp || framework_failure_
+    LC_ALL=$L grep -o $opt -i "$pat" in > out || fail=1
+    compare exp out || fail=1
+  done
 done
 
 Exit $fail
-- 
2.17.1

>From ee6b62007dcbf860f204fbc6921a4d0af74845c3 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 23 Sep 2020 19:04:01 -0700
Subject: [PATCH 4/5] grep: pacify Sun C 5.15

This suppresses a false alarm '"grep.c", line 720: warning:
initializer will be sign-extended: -1'.
* src/grep.c (uword_max): New static constant.
(initialize_unibyte_mask): Use it.
---
 src/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/grep.c b/src/grep.c
index 1a52c89..de7616a 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -684,6 +684,7 @@ clean_up_stdout (void)
 
 /* An unsigned type suitable for fast matching.  */
 typedef uintmax_t uword;
+static uword const uword_max = UINTMAX_MAX;
 
 struct localeinfo localeinfo;
 
@@ -717,7 +718,6 @@ initialize_unibyte_mask (void)
   /* Now MASK will detect any encoding-error byte, although it may
      cry wolf and it may not be optimal.  Build a uword-length mask by
      repeating MASK.  */
-  uword uword_max = -1;
   unibyte_mask = uword_max / UCHAR_MAX * mask;
 }
 
-- 
2.17.1

>From 63e1b8a4356957d24bdb6e2235e79ce55990d7f3 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 23 Sep 2020 19:15:15 -0700
Subject: [PATCH 5/5] grep: don't assume PCRE in tests

* tests/filename-lineno.pl: Remove invalid-re-P-paren and
invalid-re-P-star-paren as they assume PCRE support, which
causes a false alarm "grep: Perl matching not supported in a
--disable-perl-regexp build" on platforms without PCRE.
---
 tests/filename-lineno.pl | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/tests/filename-lineno.pl b/tests/filename-lineno.pl
index ebd8d1e..be927ef 100755
--- a/tests/filename-lineno.pl
+++ b/tests/filename-lineno.pl
@@ -97,12 +97,6 @@ my @Tests =
    ['invalid-re-G-star-paren', '-G "a.*\\)"', {EXIT=>2},
     {ERR => "$prog: Unmatched ) or \\)\n"},
    ],
-   ['invalid-re-P-paren', '-P ")"', {EXIT=>2},
-    {ERR => "$prog: unmatched parentheses\n"},
-   ],
-   ['invalid-re-P-star-paren', '-P "a.*)"', {EXIT=>2},
-    {ERR => "$prog: unmatched parentheses\n"},
-   ],
   );
 
 my $save_temps = $ENV{DEBUG};
-- 
2.17.1

bug#43577: wrong result for grep -io in turkish locale

Reply via email to