Hi, this is a very interesting discussion.
>> /(?<=\x9c)#/ >> >> Does it match \xd5\x9c# starting from #? > >No, because the input does not contain a \x9c encoding error. Encoding errors >match only themselves, not parts of other characters. That is how the glibc >matchers behave, and it's what users expect. Why \xc9 is part of another character? It depends how you interpret \xd5. And this was just a simple example. >> Noticing errors during a backward scan is complicated. > >It's doable, and it's the right thing to do. The problem is, you do it some way, and others need something else. Just think about the example above. >Range expressions have implementation-defined semantics in POSIX. For PCRE >you >can do what you like. I suggest mapping encoding-error bytes into characters >outside the Unicode range; that's what Emacs does, I think, and it's simple >and >easy to explain to users. It's not a big deal either way. This mapping idea is clever. Basically invalid codepoints are converted to something valid. >> What kind of invalid and valid UTF byte sequences are inside (and outside) >> the bounds? > >Just treat encoding-error bytes like everything else. In effect, extend the >encoding to allow any byte sequence, and add a few "characters" outside the >Unicode range, one for each invalid UTF-8 byte. In other words, \xc9 really is an encoding error (since it is an invalid UTF-8 byte, following another invalid UTF-8 byte). This is what I said from the beginning, depending on the context, people choose different interpretations of handling UTF fragments. Usually they choose what is more convenient from that viewpoint. But if you put all pieces together, the result is full of contradictions. >Sorry, I don't quite follow, but encoding errors aren't letters and don't have >case. They match only themselves. Not necessarily. It depends on your mapping: if more than one invalid UTF fragment is mapped to the same codepoint, they will match. Especially when you define range of characters. > > What unicode properties does an invalid codepoint have? > >The minimal ones. We could use the same flags as for characters between \x{d800}–\x{dfff} >> The question is, who would be willing to do this work. > >Not me. :-) I know this would be a lot of work. And I have doubts that slowing down PCRE would increase grep performance. Regardless, if somebody is willing to work on this, I can help. Please keep in mind that PCRE1 is considered done, and our efforts are limited to bugfixing. We are currently busy with PCRE2, and such a big change could only go there. >I'm not sure it'd be popular to add a --drain-battery option to grep. :) I don't think on performance hungry desktop or server environments this really matters. On phone, you likely don't need this feature. >I suggested that already, but the user (e.g., see the last paragraph of ><http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated >UTF-8 patterns in binary data. For example, I expect the user wants the >pattern >'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file. So he can't >simply use unibyte processing. This is exactly the use case where filtering is needed. His input is a combination of binary and UTF data, and he needs matches only in the UTF part. Regards, Zoltan