bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Zoltán Herczeg Fri, 26 Sep 2014 11:41:54 -0700

Hi,

this is a very interesting discussion.


>> /(?<=\x9c)#/
>>
>> Does it match \xd5\x9c# starting from #?
>
>No, because the input does not contain a \x9c encoding error.  Encoding errors 
>match only themselves, not parts of other characters.  That is how the glibc 
>matchers behave, and it's what users expect.

Why \xc9 is part of another character? It depends how you interpret \xd5. And 
this was just a simple example.

>> Noticing errors during a backward scan is complicated.
>
>It's doable, and it's the right thing to do.

The problem is, you do it some way, and others need something else. Just think 
about the example above.

>Range expressions have implementation-defined semantics in POSIX.  For PCRE 
>you 
>can do what you like.  I suggest mapping encoding-error bytes into characters 
>outside the Unicode range; that's what Emacs does, I think, and it's simple 
>and 
>easy to explain to users.  It's not a big deal either way.

This mapping idea is clever. Basically invalid codepoints are converted to 
something valid.

>> What kind of invalid and valid UTF byte sequences are inside (and outside) 
>> the bounds?
>
>Just treat encoding-error bytes like everything else.  In effect, extend the 
>encoding to allow any byte sequence, and add a few "characters" outside the 
>Unicode range, one for each invalid UTF-8 byte.

In other words, \xc9 really is an encoding error (since it is an invalid UTF-8 
byte, following another invalid UTF-8 byte). This is what I said from the 
beginning, depending on the context, people choose different interpretations of 
handling UTF fragments. Usually they choose what is more convenient from that 
viewpoint. But if you put all pieces together, the result is full of 
contradictions.

>Sorry, I don't quite follow, but encoding errors aren't letters and don't have 
>case.  They match only themselves.

Not necessarily. It depends on your mapping: if more than one invalid UTF 
fragment is mapped to the same codepoint, they will match. Especially when you 
define range of characters.

> > What unicode properties does an invalid codepoint have?
>
>The minimal ones.

We could use the same flags as for characters between \x{d800}–\x{dfff}

>> The question is, who would be willing to do this work.
>
>Not me.  :-)

I know this would be a lot of work. And I have doubts that slowing down PCRE 
would increase grep performance. Regardless, if somebody is willing to work on 
this, I can help. Please keep in mind that PCRE1 is considered done, and our 
efforts are limited to bugfixing. We are currently busy with PCRE2, and such a 
big change could only go there.

>I'm not sure it'd be popular to add a --drain-battery option to grep. :)

I don't think on performance hungry desktop or server environments this really 
matters. On phone, you likely don't need this feature.

>I suggested that already, but the user (e.g., see the last paragraph of 
><http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated 
>UTF-8 patterns in binary data.  For example, I expect the user wants the 
>pattern 
>'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file.  So he can't 
>simply use unibyte processing.

This is exactly the use case where filtering is needed. His input is a 
combination of binary and UTF data, and he needs matches only in the UTF part.

Regards,
Zoltan

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to