incorrect output with color and wrapping

2013-04-16 Thread Vincent Lefevre
With grep 2.14 (and previous versions), $ echo `seq 0 32` | env -u GREP_COLORS grep --color=always 30 gives in some terminals such as xterm (but *not* GNOME Terminal): │[...] 25 26 27 28 293│ │0 31 32 │ instead of │[...] 25 26 27 28 29 │ │30 31 32 │ Th

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 09:15:58 -0700, Paul Eggert wrote: > That commit was necessary to avoid undefined behavior in libpcre. We can't > simply undo the commit (unless you want to reintroduce security holes into > grep :-). The current behavior is the best we can do, unless someone fixes > libpcre (which

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 11:19:28 -0700, Paul Eggert wrote: > grep should work correctly even if the input contains NUL bytes, so perhaps > it would be better to replace an invalid byte by the UTF-8 sequence for > U+FFFD REPLACEMENT CHARACTER, as that's one standard way to deal with this > problem. Or perhap

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-14 Thread Vincent Lefevre
On 2014-08-14 13:13:45 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >The problem with this solution is that it would change the length > >of the text, while replacing invalid bytes by zero bytes could be > >done in place (if allowed), with very little change of

bug#18269: incorrect undossify_input prototype - possible integer overflow

2014-08-14 Thread Vincent Lefevre
In grep 2.20, grep.c contains: ssize_t fillsize; size_t readsize; [...] fillsize = safe_read (bufdesc, readbuf, readsize); if (fillsize < 0) fillsize = cc = 0; bufoffset += fillsize; fillsize = undossify_input (readbuf, fillsize); In practice, readsize can be large on a 64-bit mac

bug#18266: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-08-16 Thread Vincent Lefevre
On 2014-08-16 16:01:27 +0200, Santiago wrote: > Workaround attached. It's too slow against binary files, but I haven't > found a simpler solution. To avoid the slowness, I think that it would be better to detect (directly, not via PCRE) invalid UTF-8 sequences and replace them by null bytes *in-pl

bug#18266: grep -P and invalid exits with error

2014-09-01 Thread Vincent Lefevre
On 2014-08-29 06:43:45 -0700, Paul Eggert wrote: > Thanks, but that patch seems to depend on libpcre internals, in that it > "knows" that pcre_exec cannot possibly succeed without first checking its > entire input buffer for invalid UTF-8 bytes. Even if that's true now, it > reflects a performance

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Vincent Lefevre
On 2014-09-10 13:22:36 +0200, Santiago wrote: > Thanks! I'm including this fix in the current debian package. Unfortunately, it is very slow, with a large slowdown factor. I've just reported a new Debian concerning the performance problem. -- Vincent Lefèvre - Web: 100

bug#18266: handling bytes not part of the charset, and other garbage (was: grep -P and invalid exits with error)

2014-09-11 Thread Vincent Lefevre
On 2014-09-01 01:31:53 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >If there are many invalid UTF8 bytes, this would be slow, IMHO > > That's OK. We don't need grep -P to be fast on invalid input. I can see a too important slowdown in practical cases. > &g

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-11 Thread Vincent Lefevre
On 2014-09-11 09:22:49 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > > >There's no reason that '.' matches something that doesn't belong to > >the charset in C locale, but doesn't match in a UTF-8 locale. > > In the C locale on GNU/Linu

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-09-11 Thread Vincent Lefevre
With the patch that fixes bug 18266, grep -P works again on binary files (with invalid UTF-8 sequences), but it is now significantly slower than old versions (which could yield undefined behavior). Timings with the Debian packages on my personal svn working copy (binary + text files): 2.18-2 0.

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-11 Thread Vincent Lefevre
On 2014-09-11 18:16:29 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >the C locale corresponds to ANSI_X3.4-1968, > > No it doesn't, at least not on any current platform I'm aware of. It does on Debian: ypig% LC_ALL=C locale charmap ANSI_X3.4-1968 > >I wo

bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error

2014-09-11 Thread Vincent Lefevre
On 2014-09-11 10:07:49 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I've just reported a new Debian concerning the performance problem. > > It's not clear from http://bugs.debian.org/761157 that the performance > problem occurs only with -P, but I assume

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-09-12 Thread Vincent Lefevre
On 2014-09-11 19:53:23 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >Things could be done in grep: > > > >1. Ignore -P when the pattern would have the same meaning without -P > >(patterns could also be transformed, e.g. "a\d+b" -> "a[0-9

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-11 20:26:12 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > > >ypig% LC_ALL=C locale charmap > >ANSI_X3.4-1968 > > That may be what the 'locale' command says, but bytes with the top bit on > are considered to be valid single-byte characters.

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 09:16:45 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I just mean that "grep ." is a method given by some people, that > >was working before UTF-8. > > And it still works, if by "." one means "match one character". No,

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 09:48:08 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I think that (1) is rather simple > > You may think it simple for the REs you're interested in, but someone else > might say "hey! that doesn't cover the REs *I'm* interested in!

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 14:39:35 -0700, Paul Eggert wrote: > On 09/12/2014 02:29 PM, Vincent Lefevre wrote: > >an option to control what happens on encoding errors would be > >better and sufficient. > > It might suffice for your use cases, but it's more complicated and less >

bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 17:57:39 -0700, Paul Eggert wrote: > Currently, for example, the tz package has > a Make rule 'check_character_set' that verifies that the source files are > all properly encoded. It executes this shell command: > > ! grep -nv '^.*$' file names > >

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-11-27 Thread Vincent Lefevre
On binary files, it seems that testing the UTF-8 sequences in pcresearch.c is faster than asking pcre_exec to do that (because of the retry I assume); see attached patch. It actually checks UTF-8 only if an invalid sequence was already found by pcre_exec, assuming that pcre_exec can check the valid

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-11-28 Thread Vincent Lefevre
On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote: > Thanks for the patch. However, I seem that valid_utf() in PCRE also > considers 5 and 6 bytes characters in PCRE. In any case, even if PCRE considers these sequences as valid UTF-8, they shouldn't match because they are not part of Unicode (i

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-12-18 Thread Vincent Lefevre
Sorry for the late reply. On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote: > On Fri, 28 Nov 2014 16:50:29 +0100 > Vincent Lefevre wrote: > > What matters is whether a sequence corresponds to a valid UTF-8 > > encoded Unicode character. My patch ensures that pcre_exec

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-12-19 Thread Vincent Lefevre
On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote: > I got them from pcre_valid_utf8(), but I made some mistakes. They are > as following. > > 0xE0 0xAF 0xBF This one is valid UTF-8 and corresponds to the code point U+0BFF, and the following matches: $ printf "\xE0\xAF\xBF\n" | grep -P . ௿

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-12-19 Thread Vincent Lefevre
On 2014-09-12 03:24:49 +0200, Vincent Lefevre wrote: > Timings with the Debian packages on my personal svn working copy > (binary + text files): > > 2.18-2 0.9s with -P, 0.4s without -P > 2.20-3 11.6s with -P, 0.4s without -P I've done another test on a large PDF file. Le

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

2014-12-19 Thread Vincent Lefevre
On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote: > On Fri, 19 Dec 2014 23:00:38 +0900 > Norihiro Tanaka wrote: > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G . > Binary file (standard input) matches > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P . > $ > > regex also

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-11 Thread Vincent Lefevre
In grep 2.22, --exclude no longer works in some cases: $ cd /usr/share/doc/grep $ grep e --exclude README README is OK, but not: $ grep e --exclude README /usr/share/doc/grep/README Copyright (C) 1992, 1997-2002, 2004-2015 Free Software Foundation, Inc. [...] This breaks at least one of my sc

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-11 Thread Vincent Lefevre
On 2015-12-11 13:37:46 -0800, Paul Eggert wrote: > The change in grep 2.22 is due to an earlier bug report: > > http://bugs.gnu.org/21027 This one was about --exclude-dir, whose description in grep 2.21 is very unclear and it was already broken anyway: zira:~> grep -rl e --exclude-dir='usr*' /us

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-15 Thread Vincent Lefevre
On 2015-12-11 19:31:46 -0800, Paul Eggert wrote: > On 12/11/2015 05:56 PM, Vincent Lefevre wrote: > >or --exclude, the description is clear: > > The description changed in grep 2.22, to match the 2.22 (also, > 2.6-and-earlier) behavior. My quote was from the grep 2.22 descri

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-15 Thread Vincent Lefevre
On 2015-12-15 15:27:27 -0800, Paul Eggert wrote: > Vincent Lefevre wrote: > >For the "main case", is this > >the canonical name as returned by realpath? > > I don't see why. grep doesn't need to compute anything's realpath. How is the file name defi

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-16 Thread Vincent Lefevre
On 2015-12-15 22:24:25 -0800, Paul Eggert wrote: > Vincent Lefevre wrote: > >How is the file name defined, then? > > It's built as a string, which is passed to 'open' without worrying > about realpath. So, for instance, "foo" and "./foo" are r

bug#22144: --exclude no longer works against arguments with a directory name

2015-12-30 Thread Vincent Lefevre
On 2015-12-28 01:09:40 -0800, Paul Eggert wrote: > Vincent Lefevre wrote: > > The documentation is still ambiguous. For the "main case", is this > > the canonical name as returned by realpath? > > > > > >As you say, the 2.22 behavior does not seem ideal

bug#15444: Bug#456943: (no subject)

2016-07-14 Thread Vincent Lefevre
On 2016-07-14 09:02:48 +0200, Walter Doekes wrote: > Leon Meier wrote: > > As of today, the test case [...] still fails in (u)xterm. > > Any resolution in sight? > > I tried to reproduce, and indeed, it fails on xterm (without the 'ne' grep > option), but not in gnome-terminal. > > Does that mean

bug#15444: Bug#456943: (no subject)

2024-03-25 Thread Vincent Lefevre
On 2016-07-14 14:22:15 -0700, Vincent Lefevre wrote: > On 2016-07-14 09:02:48 +0200, Walter Doekes wrote: > > Leon Meier wrote: > > > As of today, the test case [...] still fails in (u)xterm. > > > Any resolution in sight? > > > > I tried to reproduce, and

bug#15444: Bug#456943: (no subject)

2024-03-25 Thread Vincent Lefevre
On 2024-03-25 15:49:52 +0100, Vincent Lefevre wrote: > A solution for "grep" would be to add a space+backspace before the > escape sequence. An additional note: One of the following is needed: * Detect the end of line (this may be tricky) and split the coloring into 2 part

bug#15444: One character can be lost if colors are enabled

2024-03-26 Thread Vincent Lefevre
On 2024-03-26 11:47:26 -0600, Paul Eggert wrote: > On 3/25/24 08:49, Vincent Lefevre wrote: > > This works fine in Xterm, giving on a 80-column terminal: > > > > ... > > However, this triggers the bug in GNOME Terminal (and other > > libvte-based terminals): &g