bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Paul Eggert Sat, 27 Sep 2014 13:56:10 -0700

Zoltán Herczeg wrote:

He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes. '.' does not match an invalid byte. It'sa reasonable default. If you have some users who want '.' to match an invalidbyte, you can add a flag for them, just as there's a PCRE_DOTALL flag for userswho want '.' to match newline. That being said, I doubt whether users will careenough to need such a flag. (After all, they're evidently not caring *now*, aslibpcre can't search such data at *all*.)

In the regex world, matching performance is the key aspect of an engine

Absolutely. That's why we're having this discussion: libpcre is slow whenmatching binary data.

A "simple" change like this would require a major redesign of the engine.


It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm 
volunteering....)

What should happen, if the starting offset is inside an otherwise valid UTF 
character?

The same thing that would happen if an input file started with the tail end of aUTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already;it's not a problem.

Filtering would not be needed if libpcre were like grep's other matchers
and simply worked with arbitrary binary data.


This might be efficient for engines which scans the input only forward direction

> and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily dothat. It just takes more implementation work, that's all. It's not rocketscience to go backwards through a UTF-8 string and to catch decoding errors asyou go.

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to