On 2014-09-12 09:48:08 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I think that (1) is rather simple > > You may think it simple for the REs you're interested in, but someone else > might say "hey! that doesn't cover the REs *I'm* interested in!". Solving > the problem in general is nontrivial.
This is still better than no optimization at all. > >But this is already the case: > > I was assuming the case where the input data contains an encoding error (not > a null byte) that is transformed to a null byte before the user sees it. > > Really, this null-byte-replacement business would be just too weird. I > don't see it as a viable general-purpose solution. Anyway since the problem can exist with null bytes, the problem needs to be solved for null bytes. But this is also already the case: $ printf "a\0b\n" | grep -a 'a..*b' a^@b (where the "^@" is in reverse video). So, the only "issue" would be that $ printf "a\x91b\n" | grep -a 'a..*b' would output "a^@b" instead of... possibly something worse. Indeed, outputting invalid UTF-8 sequences to the terminal is bad. Ideally you would output "a<91>b" with "<91>" in reverse video. At some price (this would be slower). Now, if the behavior is chosen by an option, the user would be aware of the meaning of the output, so that this won't really matter. > >Parsing UTF-8 is standard. > > It's a standard that keeps evolving, different releases of libpcre > have done it differently, and I expect things to continue to evolve. Could you give some reference? IMHO, this looks more like a bug. Anyway, UTF-8 sequences that are valid today will still be valid in the future. The only possible change is that new sequences become valid in the future. So, the only possible problem is that such new sequences would be converted to null bytes while this shouldn't be done. This doesn't introduce undefined behavior, just a different behavior (note that this difference would also exist between two libpcre versions, thus not a big problem, and this will be fixable). > Have you investigated why libpcre is so *slow* when doing UTF-8 checking? AFAIK, this is not due to libpcre UTF-8 checking, otherwise it would also be very slow on valid text files too. I suppose that this is due to the many retries from the pcresearch.c code on binary files (the line is split into many sublines, many often consisting of a single byte), i.e. the problem is on the grep side. I don't see how this could be solved except by doing the UTF-8 check on the grep side. > >I often want to take binary files into account > > In those cases I suggest using a unibyte C locale. But I still want "." to match a single (valid) UTF-8 character. Well, using the C locale on binary files and UTF-8 on text files might be acceptable. But how can one do that with a recursive grep? -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)