Zoltán Herczeg wrote:
Just consider these two examples, where \x9c is an incorrectly encoded unicode
codepoint:
/(?<=\x9c)#/
Does it match \xd5\x9c# starting from #?
No, because the input does not contain a \x9c encoding error. Encoding errors
match only themselves, not parts of other characters. That is how the glibc
matchers behave, and it's what users expect.
Noticing errors during a backward scan is complicated.
It's doable, and it's the right thing to do.
/[\x9c-\x{ffff}]/
What does this range defines exactly?
Range expressions have implementation-defined semantics in POSIX. For PCRE you
can do what you like. I suggest mapping encoding-error bytes into characters
outside the Unicode range; that's what Emacs does, I think, and it's simple and
easy to explain to users. It's not a big deal either way.
What kind of invalid and valid UTF byte sequences are inside (and outside) the
bounds?
Just treat encoding-error bytes like everything else. In effect, extend the
encoding to allow any byte sequence, and add a few "characters" outside the
Unicode range, one for each invalid UTF-8 byte.
Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or
\xc9 invalid UTF byte sequence?
Sorry, I don't quite follow, but encoding errors aren't letters and don't have
case. They match only themselves.
> What unicode properties does an invalid codepoint have?
The minimal ones.
depending on their needs, everybody has different answers to these questions.
That's fine. Just implement reasonable defaults, and provide options if people
have needs that differ from the defaults. That's easier for libpcre than for
grep, since libpcre users (who are programmers) can reasonably be expected to be
more sophisticated about this sort of thing than grep users (who are not
necessarily programmers).
Imagine if you would need to add buffer end and other bit checks.
Of course it will be more expensive to check for UTF-8 as you go, than to assume
the input is valid UTF-8. But again, we're not talking about the
PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talking
about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether the
input is valid UTF-8, and currently does so inefficiently. In the
non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go,
than to have a prepass that checks for UTF-8. This is because the prepass must
be stupid (it must check the entire input buffer) whereas the matcher can be
smart (it often can do its work without checking the entire input buffer). This
is one reason libpcre is slower than the glibc matchers.
Obviously it would be some work to build a libpcre that runs faster in the
non-PCRE_NO_UTF8_CHECK case, without hurting performance in the
PCRE_NO_UTF8_CHECK case. But it could be done, if someone had the time to do it.
The question is, who would be willing to do this work.
Not me. :-)
That would chew up CPU resources unnecessarily
Yeah but you could add a flag to enable this :)
I'm not sure it'd be popular to add a --drain-battery option to grep. :)
The use case that prompted
this bug report is someone using 'grep -r' to search for strings like
'foobar' in binary data, and this use case would not work with this
suggested solution.
In this case, I would simply disable UTF-8 decoding.
I suggested that already, but the user (e.g., see the last paragraph of
<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated
UTF-8 patterns in binary data. For example, I expect the user wants the pattern
'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file. So he can't
simply use unibyte processing.