On binary files, it seems that testing the UTF-8 sequences in pcresearch.c is faster than asking pcre_exec to do that (because of the retry I assume); see attached patch. It actually checks UTF-8 only if an invalid sequence was already found by pcre_exec, assuming that pcre_exec can check the validity of a valid text file in a faster way.
On some file similar to PDF (test 1): Before: 1.77s After: 1.38s But now, the main problem is the many pcre_exec. Indeed, if I replace the non-ASCII bytes by \n with: LC_ALL=C tr \\200-\\377 \\n (now, one has a valid file but with many short lines), the grep -P time is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes with: LC_ALL=C tr \\200-\\377 \\000 the grep -P time is 0.30s (test 3), thus it is much faster. Note also that libpcre is much slower than normal grep on simple words, but on "a[0-9]b", it can be faster: grep PCRE PCRE+patch test 1 4.31 1.90 1.53 test 2 0.18 1.61 1.63 test 3 3.28 0.39 0.39 With grep, I wonder why test 2 is much faster. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
diff --git a/src/pcresearch.c b/src/pcresearch.c index 5451029..6bff1e4 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -38,6 +38,8 @@ static pcre_extra *extra; # endif #endif +#define INVALID(C) (to_uchar (C) < 0x80 || to_uchar (C) > 0xbf) + /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty string matches when that flag is used. */ static int empty_match[2]; @@ -156,6 +158,7 @@ Pexecute (char const *buf, size_t size, size_t *match_size, char const *line_start = buf; int e = PCRE_ERROR_NOMATCH; char const *line_end; + int invalid = 0; /* If the input type is unknown, the caller is still testing the input, which means the current buffer cannot contain encoding @@ -212,25 +215,54 @@ Pexecute (char const *buf, size_t size, size_t *match_size, if (multiline) options |= PCRE_NO_UTF8_CHECK; - e = pcre_exec (cre, extra, p, search_bytes, 0, - options, sub, NSUB); - if (e != PCRE_ERROR_BADUTF8) + int valid_bytes = search_bytes; + if (invalid) { - if (0 < e && multiline && sub[1] - sub[0] != 0) + /* At least an encoding error was found. Other such errors + are likely to occur, and detecting them here is faster + in average than relying on pcre. */ + options |= PCRE_NO_UTF8_CHECK; + char const *p2 = p; + while (p2 != line_end) { - char const *nl = memchr (p + sub[0], eolbyte, - sub[1] - sub[0]); - if (nl) + unsigned char c = p2[0]; + size_t len = + c < 0x80 ? 1 : + c < 0xc2 || c > 0xf7 || INVALID(p2[1]) ? 0 : + c < 0xe0 ? 2 : INVALID(p2[2]) ? 0 : + c < 0xf0 ? 3 : INVALID(p2[3]) ? 0 : 4; + if (len == 0) { - /* This match crosses a line boundary; reject it. */ - p += sub[0]; - line_end = nl; - continue; + valid_bytes = p2 - p; + break; } + p2 += len; } - break; } - int valid_bytes = sub[0]; + + if (valid_bytes == search_bytes) + { + e = pcre_exec (cre, extra, p, search_bytes, 0, + options, sub, NSUB); + if (e != PCRE_ERROR_BADUTF8) + { + if (0 < e && multiline && sub[1] - sub[0] != 0) + { + char const *nl = memchr (p + sub[0], eolbyte, + sub[1] - sub[0]); + if (nl) + { + /* This match crosses a line boundary; reject it. */ + p += sub[0]; + line_end = nl; + continue; + } + } + break; + } + invalid = 1; + valid_bytes = sub[0]; + } /* Try to match the string before the encoding error. Again, handle the empty-match case specially, for speed. */