bug#13127: [PATCH] cut: use only one data strucutre

2013-05-07 Thread Pádraig Brady
On 05/06/2013 07:54 PM, Cojocaru Alexandru wrote:
> On Mon, 29 Apr 2013 00:59:20 +0100
> Pádraig Brady  wrote:
> 
>> So I reinstated the bit vector which was a little tricky
>> to do while maintaining performance, but it works very well.
> I think it works because we are avoiding a memory access
> inside `next_item' this way.
> 
> With this patch I try to keep the CPU benefits for `--output-d'
> and when large ranges are specified, even without the bitarray.
> 
> Because of the sentinel now the max line len supported will be
> `(size_t)-1 - 1' and no more `(size_t)-1'. Is this an issue?
> 
> PS: This patch also fix a little bug inside `set_fields'.

It's always best to have separate changes.
I've split the fix out (attached) with an associated test.

thanks,
Pádraig.
>From b54b47f954c9b97bdb2dbbf51ead908ccb3a4f13 Mon Sep 17 00:00:00 2001
From: Cojocaru Alexandru 
Date: Tue, 7 May 2013 13:01:46 +0100
Subject: [PATCH] cut: fix handling of overlapping ranges

This issue was introduced in commit v8.21-43-g3e466ad

* src/cut.c (set_fields): Process all range pairs when merging.
* tests/misc/cut-huge-range.sh: Add a test for this edge case.
Also fix an issue where we could miss reported errors due
to truncation of the 'err' file.
---
 src/cut.c|6 +++---
 tests/misc/cut-huge-range.sh |8 +++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/cut.c b/src/cut.c
index b347b30..9501b3a 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -496,9 +496,9 @@ set_fields (const char *fieldstr)
   if (rp[j].lo <= rp[i].hi)
 {
   rp[i].hi = MAX (rp[j].hi, rp[i].hi);
-  memmove (rp + j, rp + j + 1,
-   (n_rp - j - 1) * sizeof (struct range_pair));
-  --n_rp;
+  memmove (rp + j, rp + j + 1, (n_rp - j - 1) * sizeof *rp);
+  n_rp--;
+  j--;
 }
   else
 break;
diff --git a/tests/misc/cut-huge-range.sh b/tests/misc/cut-huge-range.sh
index 887197a..9905cd7 100755
--- a/tests/misc/cut-huge-range.sh
+++ b/tests/misc/cut-huge-range.sh
@@ -27,7 +27,13 @@ getlimits_
 
 # Up to and including coreutils-8.21, cut would allocate possibly needed
 # memory upfront.  Subsequently memory is allocated as required.
-(ulimit -v 2; : | cut -b1-$INT_MAX > err 2>&1) || fail=1
+(ulimit -v 2; : | cut -b1-$INT_MAX >> err 2>&1) || fail=1
+
+# Ensure ranges are merged correctly when large range logic is in effect
+echo 1 > exp
+(dd bs=1MB if=/dev/zero count=1; echo '1') |
+cut -b1-100,2-3,4-5,101 2>>err | tail -c2 > out || fail=1
+compare exp out || fail=1
 
 compare /dev/null err || fail=1
 
-- 
1.7.7.6



bug#13127: [PATCH] cut: use only one data strucutre

2013-05-07 Thread Pádraig Brady
On 05/07/2013 01:10 PM, Pádraig Brady wrote:
> On 05/06/2013 07:54 PM, Cojocaru Alexandru wrote:
>> On Mon, 29 Apr 2013 00:59:20 +0100
>> Pádraig Brady  wrote:
>>
>>> So I reinstated the bit vector which was a little tricky
>>> to do while maintaining performance, but it works very well.
>> I think it works because we are avoiding a memory access
>> inside `next_item' this way.
>>
>> With this patch I try to keep the CPU benefits for `--output-d'
>> and when large ranges are specified, even without the bitarray.
>>
>> Because of the sentinel now the max line len supported will be
>> `(size_t)-1 - 1' and no more `(size_t)-1'. Is this an issue?

Not a practical one.
We could bump the types/limits in the range pairs
up to uintmax_t since we're now not allocating
lot of corresponding memory.
Note I added a specific check to make it
explicit that -b$SIZE_MAX is not supported if specified.

I'll do that in a subsequent patch, but it's
not a practical issue for now, as we still allocate mem
for the whole line.

The new patch performs well!

I'll apply the attached in a little while.

thanks!
Pádraig.
>From 1a6fbf21d1a70e8a9b107f2f91188e2d3a4b Mon Sep 17 00:00:00 2001
From: Cojocaru Alexandru 
Date: Tue, 7 May 2013 13:47:15 +0100
Subject: [PATCH] cut: improve performance, especially with --output-delimiter

Use a sentinel value that's checked implicitly, rather than
a bit array, to determine if an item should be output.

Benchmark results for this change are:

$ yes abcdfeg | head -n1MB > big-file

$ for c in orig sentinel; do
src/cut-$c 2>/dev/null
echo -ne "\n== $c =="
time src/cut-$c -b1,3 big-file > /dev/null
  done
== orig ==
real0m0.049s
user0m0.044s
sys 0m0.005s

== sentinel ==
real0m0.035s
user0m0.032s
sys 0m0.002s

 ## Again with --output-delimiter ##
$ for c in orig sentinel; do
src/cut-$c 2>/dev/null
echo -ne "\n== $c =="
time src/cut-$c -b1,3 --output-delimiter=: big-file > /dev/null
  done
== orig ==
real0m0.106s
user0m0.103s
sys 0m0.002s

== sentinel ==
real0m0.055s
user0m0.052s
sys 0m0.003s

eol_range_start: Removed. 'n-' is no longer treated specially,
and instead SIZE_MAX is set for the 'hi' limit, and tested implicitly.
complement_rp: Used to complement 'rp' when '--complement' is specified.
ADD_RANGE_PAIR: Macro renamed to 'add_range_pair' function.
* tests/misc/cut-huge-range.sh: Adjust to the SENTINEL value.
---
 src/cut.c|  234 +++---
 tests/misc/cut-huge-range.sh |   15 ++-
 2 files changed, 95 insertions(+), 154 deletions(-)

diff --git a/src/cut.c b/src/cut.c
index 9501b3a..19ef1d9 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -80,21 +80,15 @@ static size_t n_rp_allocated;
space if necessary.  Update global variable N_RP.  When allocating,
update global variable N_RP_ALLOCATED.  */
 
-#define ADD_RANGE_PAIR(rp, low, high)			\
-  do			\
-{			\
-  if (low == 0 || high == 0)			\
-FATAL_ERROR (_("fields and positions are numbered from 1")); \
-  if (n_rp >= n_rp_allocated)			\
-{		\
-  (rp) = X2NREALLOC (rp, &n_rp_allocated);	\
-}		\
-  rp[n_rp].lo = (low);\
-  rp[n_rp].hi = (high);\
-  ++n_rp;		\
-}			\
-  while (0)
-
+static void
+add_range_pair (size_t lo, size_t hi)
+{
+  if (n_rp == n_rp_allocated)
+rp = X2NREALLOC (rp, &n_rp_allocated);
+  rp[n_rp].lo = lo;
+  rp[n_rp].hi = hi;
+  ++n_rp;
+}
 
 /* This buffer is used to support the semantics of the -s option
(or lack of same) when the specified field list includes (does
@@ -108,31 +102,6 @@ static char *field_1_buffer;
 /* The number of bytes allocated for FIELD_1_BUFFER.  */
 static size_t field_1_bufsize;
 
-/* The largest field or byte index used as an endpoint of a closed
-   or degenerate range specification;  this doesn't include the starting
-   index of right-open-ended ranges.  For example, with either range spec
-   '2-5,9-', '2-3,5,9-' this variable would be set to 5.  */
-static size_t max_range_endpoint;
-
-/* If nonzero, this is the index of the first field in a range that goes
-   to end of line. */
-static size_t eol_range_start;
-
-/* This is a bit vector.
-   In byte mode, which bytes to output.
-   In field mode, which DELIM-separated fields to output.
-   Both bytes and fields are numbered starting with 1,
-   so the zeroth bit of this array is unused.
-   A field or byte K has been selected if
-   (K <= MAX_RANGE_ENDPOINT && is_printable_field (K))
-|| (EOL_RANGE_START > 0 && K >= EOL_RANGE_START).  */
-static unsigned char *printable_field;
-
-/* The maximum size the printable_field array to allocate.
-   For ranges requiring more than this, we revert to the slightly
-   slower mechanism of inspecting the current range pair limits.  */
-enum { PRINTABLE_ARRAY_MAX = 65536 };
-
 enum operating_mode
   {
 undefined_mode,
@@ -151,7 +120,7 @@ static enum operating_mode operating_mo

bug#13127: [PATCH] cut: use only one data strucutre

2013-05-07 Thread Bernhard Voelker
> On May 7, 2013 at 2:10 PM Pádraig Brady  wrote:
> It's always best to have separate changes.
> I've split the fix out (attached) with an associated test.

The patch looks fine, thanks.

Have a nice day,
Berny





bug#13127: [PATCH] cut: use only one data strucutre

2013-05-07 Thread Bernhard Voelker
> On May 7, 2013 at 3:51 PM Pádraig Brady  wrote:
> I'll apply the attached in a little while.

+1

Have a nice day,
Berny