bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Vincent Lefevre Fri, 12 Sep 2014 15:09:35 -0700

On 2014-09-12 09:48:08 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I think that (1) is rather simple
> 
> You may think it simple for the REs you're interested in, but someone else
> might say "hey! that doesn't cover the REs *I'm* interested in!". Solving
> the problem in general is nontrivial.


This is still better than no optimization at all.

> >But this is already the case:
> 
> I was assuming the case where the input data contains an encoding error (not
> a null byte) that is transformed to a null byte before the user sees it.
> 
> Really, this null-byte-replacement business would be just too weird.  I
> don't see it as a viable general-purpose solution.

Anyway since the problem can exist with null bytes, the problem
needs to be solved for null bytes. But this is also already the
case:

$ printf "a\0b\n" | grep -a 'a..*b'
a^@b

(where the "^@" is in reverse video). So, the only "issue" would
be that

$ printf "a\x91b\n" | grep -a 'a..*b'

would output "a^@b" instead of... possibly something worse. Indeed,
outputting invalid UTF-8 sequences to the terminal is bad. Ideally
you would output "a<91>b" with "<91>" in reverse video. At some
price (this would be slower).

Now, if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.

> >Parsing UTF-8 is standard.
> 
> It's a standard that keeps evolving, different releases of libpcre
> have done it differently, and I expect things to continue to evolve.

Could you give some reference? IMHO, this looks more like a bug.

Anyway, UTF-8 sequences that are valid today will still be valid in
the future. The only possible change is that new sequences become
valid in the future. So, the only possible problem is that such new
sequences would be converted to null bytes while this shouldn't be
done. This doesn't introduce undefined behavior, just a different
behavior (note that this difference would also exist between two
libpcre versions, thus not a big problem, and this will be fixable).

> Have you investigated why libpcre is so *slow* when doing UTF-8 checking?

AFAIK, this is not due to libpcre UTF-8 checking, otherwise it would
also be very slow on valid text files too. I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side. I don't see how this
could be solved except by doing the UTF-8 check on the grep side.

> >I often want to take binary files into account
> 
> In those cases I suggest using a unibyte C locale.

But I still want "." to match a single (valid) UTF-8 character.
Well, using the C locale on binary files and UTF-8 on text files
might be acceptable. But how can one do that with a recursive
grep?

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to