Thanks to everyone who reported and fixed this bug. I looked over the fix and
this inspired me to improve on it. I installed the attached patch, which doesn't
fix any functionality bugs, but does improve performance significantly in some
cases.
>From 86ec0ec94e175d96a8910acfff8bb31735078ed5 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 6 Jan 2016 22:40:23 -0800
Subject: [PATCH] Improve on fix for Bug#22181
* src/pcresearch.c (Pexecute): Update subject when skipping past
easily-determined encoding errors, as this is faster than letting
pcre_exec skip them. On my platform this improves performance
4.7x on a benchmark created via "yes $(printf '\200\200\200\200
\200\200\200\200\200\200\200\200\200\200\200\200\200\200\200\200x\n')
| head -n 1000000 >j; grep -oP y j" in a UTF-8 locale. Rework
code that deals with PCRE_ERROR_BADUTF8 return, to avoid an
incorrect (albeit currently harmless) 'bol = false' assignment.
---
src/pcresearch.c | 40 +++++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 19 deletions(-)
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 8f3d935..c0b8678 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -229,6 +229,7 @@ Pexecute (char *buf, size_t size, size_t *match_size,
while (mbclen_cache[to_uchar (*p)] == (size_t) -1)
{
p++;
+ subject = p;
bol = false;
}
@@ -269,29 +270,30 @@ Pexecute (char *buf, size_t size, size_t *match_size,
}
int valid_bytes = sub[0];
- /* Try to match the string before the encoding error. */
- if (valid_bytes < search_offset)
- e = PCRE_ERROR_NOMATCH;
- else if (valid_bytes == 0)
+ if (search_offset <= valid_bytes)
{
- /* Handle the empty-match case specially, for speed.
- This optimization is valid if VALID_BYTES is zero,
- which means SEARCH_OFFSET is also zero. */
- sub[1] = 0;
- e = empty_match[bol];
- }
- else
- e = jit_exec (subject, valid_bytes, search_offset,
- options | PCRE_NO_UTF8_CHECK | PCRE_NOTEOL, sub);
+ /* Try to match the string before the encoding error. */
+ if (valid_bytes == 0)
+ {
+ /* Handle the empty-match case specially, for speed.
+ This optimization is valid if VALID_BYTES is zero,
+ which means SEARCH_OFFSET is also zero. */
+ sub[1] = 0;
+ e = empty_match[bol];
+ }
+ else
+ e = jit_exec (subject, valid_bytes, search_offset,
+ options | PCRE_NO_UTF8_CHECK | PCRE_NOTEOL, sub);
- if (e != PCRE_ERROR_NOMATCH)
- break;
+ if (e != PCRE_ERROR_NOMATCH)
+ break;
+
+ /* Treat the encoding error as data that cannot match. */
+ p = subject + valid_bytes + 1;
+ bol = false;
+ }
- /* Treat the encoding error as data that cannot match. */
subject += valid_bytes + 1;
- if (p < subject)
- p = subject;
- bol = false;
}
if (e != PCRE_ERROR_NOMATCH)
--
2.5.0