bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-12 Thread Jim Meyering
On Thu, Sep 11, 2014 at 12:10 PM, Paul Eggert wrote: > On 09/11/2014 11:37 AM, Jim Meyering wrote: >> >> Would you mind adding a test to trigger that one? > > Ordinarily I would have done that already but this -P stuff is so buggy and > slow that I got discouraged. (If we keep having trouble with

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Vincent Lefevre
On 2014-09-11 10:07:49 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I've just reported a new Debian concerning the performance problem. > > It's not clear from http://bugs.debian.org/761157 that the performance > problem occurs only with -P, but I assume that's what is meant. It's specif

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Paul Eggert
On 09/11/2014 11:37 AM, Jim Meyering wrote: Would you mind adding a test to trigger that one? Ordinarily I would have done that already but this -P stuff is so buggy and slow that I got discouraged. (If we keep having trouble with -P I may start lobbying to remove it) Anyway, I gave it a

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Jim Meyering
mitting the '-P'. Also, I suggest using the C locale. > > As the GNU bug 18266 "grep -P and invalid exits with error" has been fixed, > I'm closing that bug report. Please feel free to open a separate GNU bug > report for the performance issue. > > PS. W

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Paul Eggert
a problem that requires changes to libpcre3 to fix; grep cannot fix it. In the meantime, in order to use 'grep' to search for strings in arbitrary data, I suggest omitting the '-P'. Also, I suggest using the C locale. As the GNU bug 18266 "grep -P and invalid exits with

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Vincent Lefevre
On 2014-09-10 13:22:36 +0200, Santiago wrote: > Thanks! I'm including this fix in the current debian package. Unfortunately, it is very slow, with a large slowdown factor. I've just reported a new Debian concerning the performance problem. -- Vincent Lefèvre - Web: 100

bug#18266: grep -P and invalid exits with error

2014-09-10 Thread Norihiro Tanaka
Thanks. I have confirmed that new version has expected response as following. $ env LC_ALL=en_US.utf8 src/grep -P '.?b' in ab

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-10 Thread Santiago
El 10/09/14 a las 00:08, Paul Eggert escribió: > Paul Eggert wrote: > >perhaps there's a PCRE version dependency here? > > I found a PCRE-version-dependent problem that may be relevant, and installed > the attached further patch to fix it. Thanks! I'm including this fix in the current debian pack

bug#18266: grep -P and invalid exits with error

2014-09-10 Thread Paul Eggert
Paul Eggert wrote: perhaps there's a PCRE version dependency here? I found a PCRE-version-dependent problem that may be relevant, and installed the attached further patch to fix it. From dc7d532d16dec740d11b6817c9b558543aca0136 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 10 Sep 2014

bug#18266: grep -P and invalid exits with error

2014-09-09 Thread Paul Eggert
Norihiro Tanaka wrote: I see that new version has no response for following test which was used previously. printf '\x80ab\n' | env LC_ALL=en_US.utf8 src/grep -P '.?b' Thanks for reporting that. The test case works for me (Fedora 20 x86-64, GCC 4.9.1): $ printf '\x80ab\n' | env LC_AL

bug#18266: grep -P and invalid exits with error

2014-09-09 Thread Norihiro Tanaka
I see that new version has no response for following test which was used previously. printf '\x80ab\n' | env LC_ALL=en_US.utf8 src/grep -P '.?b'

bug#18266: grep -P and invalid exits with error

2014-09-09 Thread Paul Eggert
Norihiro Tanaka wrote: I'm worried that to re-run for invalid UTF-8 makes slowness for searching of the large number of binary files. Yes, that could be a problem, but even so it's better for grep to report matches than to give up and fail. Perhaps someone could optimize this better later, b

bug#18266: grep -P and invalid exits with error

2014-09-09 Thread Norihiro Tanaka
I'm worried that to re-run for invalid UTF-8 makes slowness for searching of the large number of binary files.

bug#18266: grep -P and invalid exits with error

2014-09-08 Thread Santiago
Patch updated. Paul, thanks for the previous comments. As you suggested, the attached patch doesn't copy the buffer and splits the input when it finds an invalid character. For the moment, I don't see a cleaner way to avoid the pcre internals. Regards, Santiago From d58b53f86bb3f4b97137f708c159

bug#18266: grep -P and invalid exits with error

2014-09-01 Thread Paul Eggert
Vincent Lefevre wrote: [...] Note that this option can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity checking of subject strings only. If the same string is being matched many times, the option can be safely set for the second and

bug#18266: grep -P and invalid exits with error

2014-09-01 Thread Vincent Lefevre
On 2014-08-29 06:43:45 -0700, Paul Eggert wrote: > Thanks, but that patch seems to depend on libpcre internals, in that it > "knows" that pcre_exec cannot possibly succeed without first checking its > entire input buffer for invalid UTF-8 bytes. Even if that's true now, it > reflects a performance

bug#18266: grep -P and invalid exits with error

2014-08-29 Thread Paul Eggert
Thanks, but that patch seems to depend on libpcre internals, in that it "knows" that pcre_exec cannot possibly succeed without first checking its entire input buffer for invalid UTF-8 bytes. Even if that's true now, it reflects a performance bug that might be fixed in a future libpcre version.

bug#18266: grep -P and invalid exits with error

2014-08-29 Thread Eric Blake
On 08/28/2014 11:47 PM, Santiago wrote: > El 16/08/14 a las 11:36, Paul Eggert escribió: >> > Santiago wrote: >>> > >Another solution would be to don't check if binary files are valid >>> > >(passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd >>> > >avoid security holes >> > >> >

bug#18266: grep -P and invalid exits with error

2014-08-28 Thread Santiago
El 16/08/14 a las 11:36, Paul Eggert escribió: > Santiago wrote: > >Another solution would be to don't check if binary files are valid > >(passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd > >avoid security holes > > It wouldn't. (We already tried it.) > Another try. This pat

bug#18266: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-16 Thread Paul Eggert
Santiago wrote: Another solution would be to don't check if binary files are valid (passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd avoid security holes It wouldn't. (We already tried it.)

bug#18266: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-16 Thread Santiago
El 16/08/14 a las 18:26, Vincent Lefevre escribió: > On 2014-08-16 16:01:27 +0200, Santiago wrote: > > Workaround attached. It's too slow against binary files, but I haven't > > found a simpler solution. > > To avoid the slowness, I think that it would be better to detect > (directly, not via PCRE

bug#18266: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-16 Thread Vincent Lefevre
On 2014-08-16 16:01:27 +0200, Santiago wrote: > Workaround attached. It's too slow against binary files, but I haven't > found a simpler solution. To avoid the slowness, I think that it would be better to detect (directly, not via PCRE) invalid UTF-8 sequences and replace them by null bytes *in-pl

bug#18266: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-16 Thread Santiago
El 14/08/14 a las 14:33, Paul Eggert escribió: > Vincent Lefevre wrote: > >On input, using null bytes may be better if one wants to be able to > >match real replacement characters without false positives. > > Maybe, though this is no place to get fancy. It's simple to tell users "an > invalid byt

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Paul Eggert
Vincent Lefevre wrote: On input, using null bytes may be better if one wants to be able to match real replacement characters without false positives. Maybe, though this is no place to get fancy. It's simple to tell users "an invalid byte acts like '?'". Simple is good. Anyway, this is a ma

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 13:13:45 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >The problem with this solution is that it would change the length > >of the text, while replacing invalid bytes by zero bytes could be > >done in place (if allowed), with very little change of the code, > >I think. > > Tr

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Paul Eggert
Vincent Lefevre wrote: The problem with this solution is that it would change the length of the text, while replacing invalid bytes by zero bytes could be done in place (if allowed), with very little change of the code, I think. True. Though it might be more user-friendly to use '?' as the re

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 11:19:28 -0700, Paul Eggert wrote: > grep should work correctly even if the input contains NUL bytes, so perhaps > it would be better to replace an invalid byte by the UTF-8 sequence for > U+FFFD REPLACEMENT CHARACTER, as that's one standard way to deal with this > problem. Or perhap

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Paul Eggert
Vincent Lefevre wrote: it would be better to replace invalid UTF-8 sequences by zero bytes before passing them to libpcre. Is it allowed to do that in Pexecute()? Sorry, I don't know. I was hoping that the volunteer (whoever it is) could figure all this stuff out. grep should work correctl

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 09:15:58 -0700, Paul Eggert wrote: > That commit was necessary to avoid undefined behavior in libpcre. We can't > simply undo the commit (unless you want to reintroduce security holes into > grep :-). The current behavior is the best we can do, unless someone fixes > libpcre (which

bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Paul Eggert
Santiago wrote: Please, revert ca7868cc27db3d9deafaa2e0ac5a2bb0aa8ef373 That commit was necessary to avoid undefined behavior in libpcre. We can't simply undo the commit (unless you want to reintroduce security holes into grep :-). The current behavior is the best we can do, unless someone

bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Santiago
Hi, Please, revert ca7868cc27db3d9deafaa2e0ac5a2bb0aa8ef373 That commit (re)introduced a regression bug (See http://debbugs.gnu.org/15758). pcresearch checks again if input is UTF-8 valid. The problem is that binary files are utf-8 invalid, so grep -P, in unicode locales, exits with error: LANG=