bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Vincent Lefevre Fri, 12 Sep 2014 01:22:14 -0700

On 2014-09-11 19:53:23 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >Things could be done in grep:
> >
> >1. Ignore -P when the pattern would have the same meaning without -P
> >    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> >    at least for the simplest cases).
> >
> >2. Call PCRE in the C locale when this is equivalent.
> 
> I had already considered these ideas along with several others, but they
> would require grep to parse and analyze the Perl regular expression.  I
> don't know the PCRE syntax and it would take some time to write a parser.
> And even if I wrote one, the next PCRE release would likely change the
> syntax.  It sounds very painful to maintain.


I think that (1) is rather simple, even though optimization could
be missed on some patterns: ERE and PCRE have a large equivalent
subclass. The pattern could be examined left to right and would
consist of:
  - Normal characters.
  - ".", "^" at the beginning, "$" (alone) at the end.
  - [] with normal characters inside.
  - "*", "+", "?", "{...}" form not followed by one of "*+?{".
  - "|" and "(" not followed by one of these 4 characters.
  - "\" followed by one of ".^$[*+?{".
  - Some "\" + letter sequences could be recognised as well.

Something like that (I haven't checked carefully). There could be
another option to allow such an optimization or not.

> >3. Transform invalid bytes to null bytes in-place before the PCRE
> >    call. This changes the current semantic, but:
> >    * the semantic on invalid bytes has never been specified, AFAIK;
> >    * the best *practical* behavior may not be the current one
> 
> As we've already discussed, this would be incompatible with how invalid
> bytes are treated by other matchers.

The same thing could be done with other matchers (in an optional way).

> And would have undesirable practical effects, e.g., the pattern
> 'a..*b' would match data that would look like "ab" on many screens
> (because the null byte would vanish). It's a real kludge that will
> bite users.

But this is already the case:

$ printf "a\0b\n"
ab
$ printf "a\0b\n" | grep 'a..*b'
Binary file (standard input) matches

The transformation won't touch null bytes. It would just interpret
invalid bytes as null bytes, so that they get matched by ".".

> Even if we went along with the kludge, grep does not know what bytes
> PCRE considers to be invalid without invoking PCRE, which is what
> it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are
> different ways to do that and they don't all agree.)

It would be a bug in PCRE. Parsing UTF-8 is standard. This is
sumarized by:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

(from the Linux utf-8(7) man page), everything else being invalid.

Note that the pcre_exec(3) man page even say:

    PCRE_NO_UTF8_CHECK  Do not check the subject for UTF-8 validity

assuming that the check can be done on the user's side, i.e. in a
standard way.

> Here's a different idea. How about invoking grep with the
> --binary-files=without-match option? This should avoid much of the
> libpcre performance problem, without having to change 'grep'.

I often want to take binary files into account, for instance because
executables can contain text I search for (error messages...). There
may be other examples.

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to