On binary files, it seems that testing the UTF-8 sequences in
pcresearch.c is faster than asking pcre_exec to do that (because
of the retry I assume); see attached patch. It actually checks
UTF-8 only if an invalid sequence was already found by pcre_exec,
assuming that pcre_exec can check the validity of a valid text
file in a faster way.

On some file similar to PDF (test 1):

Before: 1.77s
After:  1.38s

But now, the main problem is the many pcre_exec. Indeed, if I replace
the non-ASCII bytes by \n with:

  LC_ALL=C tr \\200-\\377 \\n

(now, one has a valid file but with many short lines), the grep -P time
is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
with:

  LC_ALL=C tr \\200-\\377 \\000

the grep -P time is 0.30s (test 3), thus it is much faster.

Note also that libpcre is much slower than normal grep on simple words,
but on "a[0-9]b", it can be faster:

          grep      PCRE   PCRE+patch
test 1    4.31      1.90      1.53
test 2    0.18      1.61      1.63
test 3    3.28      0.39      0.39

With grep, I wonder why test 2 is much faster.

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 5451029..6bff1e4 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -38,6 +38,8 @@ static pcre_extra *extra;
 # endif
 #endif
 
+#define INVALID(C) (to_uchar (C) < 0x80 || to_uchar (C) > 0xbf)
+
 /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty
    string matches when that flag is used.  */
 static int empty_match[2];
@@ -156,6 +158,7 @@ Pexecute (char const *buf, size_t size, size_t *match_size,
   char const *line_start = buf;
   int e = PCRE_ERROR_NOMATCH;
   char const *line_end;
+  int invalid = 0;
 
   /* If the input type is unknown, the caller is still testing the
      input, which means the current buffer cannot contain encoding
@@ -212,25 +215,54 @@ Pexecute (char const *buf, size_t size, size_t 
*match_size,
           if (multiline)
             options |= PCRE_NO_UTF8_CHECK;
 
-          e = pcre_exec (cre, extra, p, search_bytes, 0,
-                         options, sub, NSUB);
-          if (e != PCRE_ERROR_BADUTF8)
+          int valid_bytes = search_bytes;
+          if (invalid)
             {
-              if (0 < e && multiline && sub[1] - sub[0] != 0)
+              /* At least an encoding error was found. Other such errors
+                 are likely to occur, and detecting them here is faster
+                 in average than relying on pcre.  */
+              options |= PCRE_NO_UTF8_CHECK;
+              char const *p2 = p;
+              while (p2 != line_end)
                 {
-                  char const *nl = memchr (p + sub[0], eolbyte,
-                                           sub[1] - sub[0]);
-                  if (nl)
+                  unsigned char c = p2[0];
+                  size_t len =
+                    c < 0x80 ? 1 :
+                    c < 0xc2 || c > 0xf7 || INVALID(p2[1]) ? 0 :
+                    c < 0xe0 ? 2 : INVALID(p2[2]) ? 0 :
+                    c < 0xf0 ? 3 : INVALID(p2[3]) ? 0 : 4;
+                  if (len == 0)
                     {
-                      /* This match crosses a line boundary; reject it.  */
-                      p += sub[0];
-                      line_end = nl;
-                      continue;
+                      valid_bytes = p2 - p;
+                      break;
                     }
+                  p2 += len;
                 }
-              break;
             }
-          int valid_bytes = sub[0];
+
+          if (valid_bytes == search_bytes)
+            {
+              e = pcre_exec (cre, extra, p, search_bytes, 0,
+                             options, sub, NSUB);
+              if (e != PCRE_ERROR_BADUTF8)
+                {
+                  if (0 < e && multiline && sub[1] - sub[0] != 0)
+                    {
+                      char const *nl = memchr (p + sub[0], eolbyte,
+                                               sub[1] - sub[0]);
+                      if (nl)
+                        {
+                          /* This match crosses a line boundary; reject it.  */
+                          p += sub[0];
+                          line_end = nl;
+                          continue;
+                        }
+                    }
+                  break;
+                }
+              invalid = 1;
+              valid_bytes = sub[0];
+            }
 
           /* Try to match the string before the encoding error.
              Again, handle the empty-match case specially, for speed.  */

Reply via email to