bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Paul Eggert Fri, 26 Sep 2014 01:54:08 -0700

Zoltán Herczeg wrote:

Just consider these two examples, where \x9c is an incorrectly encoded unicode 
codepoint:


/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error. Encoding errorsmatch only themselves, not parts of other characters. That is how the glibcmatchers behave, and it's what users expect.

Noticing errors during a backward scan is complicated.


It's doable, and it's the right thing to do.

/[\x9c-\x{ffff}]/

What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX. For PCRE youcan do what you like. I suggest mapping encoding-error bytes into charactersoutside the Unicode range; that's what Emacs does, I think, and it's simple andeasy to explain to users. It's not a big deal either way.

What kind of invalid and valid UTF byte sequences are inside (and outside) the 
bounds?

Just treat encoding-error bytes like everything else. In effect, extend theencoding to allow any byte sequence, and add a few "characters" outside theUnicode range, one for each invalid UTF-8 byte.

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or 
\xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't havecase. They match only themselves.


> What unicode properties does an invalid codepoint have?

The minimal ones.

depending on their needs, everybody has different answers to these questions.

That's fine. Just implement reasonable defaults, and provide options if peoplehave needs that differ from the defaults. That's easier for libpcre than forgrep, since libpcre users (who are programmers) can reasonably be expected to bemore sophisticated about this sort of thing than grep users (who are notnecessarily programmers).

Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to assumethe input is valid UTF-8. But again, we're not talking about thePCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talkingabout the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether theinput is valid UTF-8, and currently does so inefficiently. In thenon-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go,than to have a prepass that checks for UTF-8. This is because the prepass mustbe stupid (it must check the entire input buffer) whereas the matcher can besmart (it often can do its work without checking the entire input buffer). Thisis one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in thenon-PCRE_NO_UTF8_CHECK case, without hurting performance in thePCRE_NO_UTF8_CHECK case. But it could be done, if someone had the time to do it.

The question is, who would be willing to do this work.


Not me.  :-)

That would chew up CPU resources unnecessarily

Yeah but you could add a flag to enable this :)


I'm not sure it'd be popular to add a --drain-battery option to grep. :)

The use case that prompted
this bug report is someone using 'grep -r' to search for strings like
'foobar' in binary data, and this use case would not work with this
suggested solution.


In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicatedUTF-8 patterns in binary data. For example, I expect the user wants the pattern'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file. So he can'tsimply use unibyte processing.

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to