On 2014-09-11 19:53:23 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >Things could be done in grep: > > > >1. Ignore -P when the pattern would have the same meaning without -P > > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", > > at least for the simplest cases). > > > >2. Call PCRE in the C locale when this is equivalent. > > I had already considered these ideas along with several others, but they > would require grep to parse and analyze the Perl regular expression. I > don't know the PCRE syntax and it would take some time to write a parser. > And even if I wrote one, the next PCRE release would likely change the > syntax. It sounds very painful to maintain.
I think that (1) is rather simple, even though optimization could be missed on some patterns: ERE and PCRE have a large equivalent subclass. The pattern could be examined left to right and would consist of: - Normal characters. - ".", "^" at the beginning, "$" (alone) at the end. - [] with normal characters inside. - "*", "+", "?", "{...}" form not followed by one of "*+?{". - "|" and "(" not followed by one of these 4 characters. - "\" followed by one of ".^$[*+?{". - Some "\" + letter sequences could be recognised as well. Something like that (I haven't checked carefully). There could be another option to allow such an optimization or not. > >3. Transform invalid bytes to null bytes in-place before the PCRE > > call. This changes the current semantic, but: > > * the semantic on invalid bytes has never been specified, AFAIK; > > * the best *practical* behavior may not be the current one > > As we've already discussed, this would be incompatible with how invalid > bytes are treated by other matchers. The same thing could be done with other matchers (in an optional way). > And would have undesirable practical effects, e.g., the pattern > 'a..*b' would match data that would look like "ab" on many screens > (because the null byte would vanish). It's a real kludge that will > bite users. But this is already the case: $ printf "a\0b\n" ab $ printf "a\0b\n" | grep 'a..*b' Binary file (standard input) matches The transformation won't touch null bytes. It would just interpret invalid bytes as null bytes, so that they get matched by ".". > Even if we went along with the kludge, grep does not know what bytes > PCRE considers to be invalid without invoking PCRE, which is what > it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are > different ways to do that and they don't all agree.) It would be a bug in PCRE. Parsing UTF-8 is standard. This is sumarized by: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (from the Linux utf-8(7) man page), everything else being invalid. Note that the pcre_exec(3) man page even say: PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8 validity assuming that the check can be done on the user's side, i.e. in a standard way. > Here's a different idea. How about invoking grep with the > --binary-files=without-match option? This should avoid much of the > libpcre performance problem, without having to change 'grep'. I often want to take binary files into account, for instance because executables can contain text I search for (error messages...). There may be other examples. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)