Zoltán Herczeg wrote:
He said 'I still want "." to match a single (valid) UTF-8 character.'
That's what the GNU matchers do, yes. '.' does not match an invalid byte. It's
a reasonable default. If you have some users who want '.' to match an invalid
byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users
who want '.' to match newline. That being said, I doubt whether users will care
enough to need such a flag. (After all, they're evidently not caring *now*, as
libpcre can't search such data at *all*.)
In the regex world, matching performance is the key aspect of an engine
Absolutely. That's why we're having this discussion: libpcre is slow when
matching binary data.
A "simple" change like this would require a major redesign of the engine.
It'd be nontrivial, yes. But it's clearly doable. (Not that I'm
volunteering....)
What should happen, if the starting offset is inside an otherwise valid UTF
character?
The same thing that would happen if an input file started with the tail end of a
UTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already;
it's not a problem.
Filtering would not be needed if libpcre were like grep's other matchers
and simply worked with arbitrary binary data.
This might be efficient for engines which scans the input only forward direction
> and read every character once.
It can also be efficient for matchers, like grep's, that don't necessarily do
that. It just takes more implementation work, that's all. It's not rocket
science to go backwards through a UTF-8 string and to catch decoding errors as
you go.