bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Norihiro Tanaka Fri, 28 Nov 2014 06:34:03 -0800

On Fri, 28 Nov 2014 03:59:18 +0100
Vincent Lefevre <[email protected]> wrote:


> On binary files, it seems that testing the UTF-8 sequences in
> pcresearch.c is faster than asking pcre_exec to do that (because
> of the retry I assume); see attached patch. It actually checks
> UTF-8 only if an invalid sequence was already found by pcre_exec,
> assuming that pcre_exec can check the validity of a valid text
> file in a faster way.
> 
> On some file similar to PDF (test 1):
> 
> Before: 1.77s
> After:  1.38s
> 
> But now, the main problem is the many pcre_exec. Indeed, if I replace
> the non-ASCII bytes by \n with:
> 
>   LC_ALL=C tr \\200-\\377 \\n
> 
> (now, one has a valid file but with many short lines), the grep -P time
> is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
> with:
> 
>   LC_ALL=C tr \\200-\\377 \\000
> 
> the grep -P time is 0.30s (test 3), thus it is much faster.
> 
> Note also that libpcre is much slower than normal grep on simple words,
> but on "a[0-9]b", it can be faster:
> 
>           grep      PCRE   PCRE+patch
> test 1    4.31      1.90      1.53
> test 2    0.18      1.61      1.63
> test 3    3.28      0.39      0.39
> 
> With grep, I wonder why test 2 is much faster.
> 
> -- 
> Vincent Lefevre <[email protected]> - Web: <https://www.vinc17.net/>
> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Thanks for the patch.  However, I seem that valid_utf() in PCRE also
considers 5 and 6 bytes characters in PCRE.

IMHO, We assume that grep doesn't know how to check for an input text in
valid_utf(), althouth we know PCRE checks whether an input text is valid
utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
grep should run.

If we do not check invalid utf8 characters with valid_utf8() in advance,
grep may cause core dump with PCRE_NO_UTF8_CHECK.
See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586

So we can not avoid for checking invalid utf8 characters with valid_utf8().
Further more, we must perform to check as PCRE expects, but grep does
not know how to PCRE to check invalid_utf8 characters due to an above
assumption.

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to