bug#13127: [PATCH] cut: use only one data strucutre

2013-04-28 Thread Cojocaru Alexandru
On Sun, 28 Apr 2013 02:51:13 +0100
Pádraig Brady  wrote:
> So looking in detail, this central print_kth function is of most importance 
> to performance.
I made the same conclusion as yours, see:
http://lists.gnu.org/archive/html/bug-coreutils/2012-12/msg00045.html

> I thought that your simplification of it might allow it to be auto inlined.
> but I confirmed that gcc 4.6.0 -O2 does not do this at present by doing:
> 
>   objdump -d src/cut.o | grep -q ':' && echo called || echo inlined
With gcc 4.8.0 -O2 both `print_kth' and `is_range_start' are inlined
even without the `inline' keyword:
nm src/cut | grep print_kth
nm src/cut | grep is_range_start
both the above comands give me no output.

> Marking it as inlined gives another gain as shown below.
> 
> Testing these combinations, we have:
> orig = bit array implementation
> split = ditto + simplified print_kth
> split-inline = ditto + inlined print_kth
> mem = no bit array
> mem-split = ditto + simplified print_kth
> mem-inline = ditto + inlined print_kth
> 
> $ yes abcdfeg | head -n1MB > big-file
> $ for c in orig split split-inline mem mem-split mem-split-inline; do
> src/cut-$c 2>/dev/null
> echo -ne "\n== $c =="
> time src/cut-$c -b1,3 big-file > /dev/null
>   done
> 
> == orig ==
> real  0m0.084s
> user  0m0.078s
> sys   0m0.006s
> 
> == split ==
> real  0m0.077s
> user  0m0.070s
> sys   0m0.006s
> 
> == split-inline ==
> real  0m0.055s
> user  0m0.049s
> sys   0m0.006s
> 
> == mem ==
> real  0m0.111s
> user  0m0.108s
> sys   0m0.002s
> 
> == mem-split ==
> real  0m0.088s
> user  0m0.081s
> sys   0m0.007s
> 
> == mem-split-inline ==
> real  0m0.070s
> user  0m0.060s
> sys   0m0.009s
> 
> So in summary, removing the bit array does slow things down,
I think that the problem lies in `print_kth' again. I've wrongly put
an useless branch in it. See the attachment for a patch.
Another problem may be the merging and the call to `xrealloc' that
we do at the end of `set_fields'.

> but with the advantage of disassociating mem usage from range width.
> I'll split the patch into two for the mem change and the cpu change,
> and might follow up with a subsequent patch to reinstate the bit array
> for the common case of small -[bcf] and no --output-delim.
My primary goal was to simplify the code. Even if the attached patch
dosen't work, I think that detecting small -[bcf] ranges would just make
the code more cumbersome.

> That's a common trend in these mem adjustment patches.
> I.E. Find a point to switch from the more CPU efficient method,
> to one which is more memory efficient.
> 
> thanks,
> Pádraig.

Please could you re-run the benchmarks after applying the patch?
Could you also try with a bigger file (for example 100MB)? So as
to make the difference among the various patches more clear.
Unfortunately I'm under an emulator and the benchmarks aren't very
faithful.

Best regards,
Cojocaru Alexandru





bug#13127: [PATCH] cut: use only one data strucutre

2013-04-28 Thread Pádraig Brady
On 04/28/2013 12:44 PM, Cojocaru Alexandru wrote:
> On Sun, 28 Apr 2013 02:51:13 +0100
> Pádraig Brady  wrote:
>> So looking in detail, this central print_kth function is of most importance 
>> to performance.
> I made the same conclusion as yours, see:
>   http://lists.gnu.org/archive/html/bug-coreutils/2012-12/msg00045.html
> 
>> I thought that your simplification of it might allow it to be auto inlined.
>> but I confirmed that gcc 4.6.0 -O2 does not do this at present by doing:
>>
>>   objdump -d src/cut.o | grep -q ':' && echo called || echo 
>> inlined
> With gcc 4.8.0 -O2 both `print_kth' and `is_range_start' are inlined
> even without the `inline' keyword:
> nm src/cut | grep print_kth
> nm src/cut | grep is_range_start
> both the above comands give me no output.

Good gcc is getting better.
We'll leave the inline for now at least,
to aid non bleeding edge gcc.

>> Marking it as inlined gives another gain as shown below.
>>
>> Testing these combinations, we have:
>> orig = bit array implementation
>> split = ditto + simplified print_kth
>> split-inline = ditto + inlined print_kth
>> mem = no bit array
>> mem-split = ditto + simplified print_kth
>> mem-inline = ditto + inlined print_kth
>>
>> $ yes abcdfeg | head -n1MB > big-file
>> $ for c in orig split split-inline mem mem-split mem-split-inline; do
>> src/cut-$c 2>/dev/null
>> echo -ne "\n== $c =="
>> time src/cut-$c -b1,3 big-file > /dev/null
>>   done
>>
>> == orig ==
>> real 0m0.084s
>> user 0m0.078s
>> sys  0m0.006s
>>
>> == split ==
>> real 0m0.077s
>> user 0m0.070s
>> sys  0m0.006s
>>
>> == split-inline ==
>> real 0m0.055s
>> user 0m0.049s
>> sys  0m0.006s
>>
>> == mem ==
>> real 0m0.111s
>> user 0m0.108s
>> sys  0m0.002s
>>
>> == mem-split ==
>> real 0m0.088s
>> user 0m0.081s
>> sys  0m0.007s
>>
>> == mem-split-inline ==
>> real 0m0.070s
>> user 0m0.060s
>> sys  0m0.009s
>>
>> So in summary, removing the bit array does slow things down,
> I think that the problem lies in `print_kth' again. I've wrongly put
> an useless branch in it. See the attachment for a patch.

Did you forget to attach?

> Another problem may be the merging and the call to `xrealloc' that
> we do at the end of `set_fields'.

That's only called at startup so I wouldn't worry too much.
What was your specific concern here?

>> but with the advantage of disassociating mem usage from range width.
>> I'll split the patch into two for the mem change and the cpu change,
>> and might follow up with a subsequent patch to reinstate the bit array
>> for the common case of small -[bcf] and no --output-delim.
> My primary goal was to simplify the code. Even if the attached patch
> dosen't work, I think that detecting small -[bcf] ranges would just make
> the code more cumbersome.

Yes it's a trade off. For often used tools such as coreutils though,
it's sometimes worth a little bit extra complexity for performance reasons.
Here we might be able to guide the compiler around the branches like:

print_kth()
{
  if likely(bitarray_used)
 ...
  else
 ...
}

Anyway I'll wait for your patch before carefully considering
to reinstate the bit array.

>> That's a common trend in these mem adjustment patches.
>> I.E. Find a point to switch from the more CPU efficient method,
>> to one which is more memory efficient.
>>
>> thanks,
>> Pádraig.
> 
> Please could you re-run the benchmarks after applying the patch?
> Could you also try with a bigger file (for example 100MB)? So as
> to make the difference among the various patches more clear.
> Unfortunately I'm under an emulator and the benchmarks aren't very
> faithful.

Sure. Eagerly waiting the patch :)

Pádraig.






bug#13127: [PATCH] cut: use only one data strucutre

2013-04-28 Thread Cojocaru Alexandru
On Sun, 28 Apr 2013 15:04:31 +0100
Pádraig Brady  wrote:

> On 04/28/2013 12:44 PM, Cojocaru Alexandru wrote:
> > Another problem may be the merging and the call to `xrealloc' that
> > we do at the end of `set_fields'.
> 
> That's only called at startup so I wouldn't worry too much.
> What was your specific concern here?
The file you used with the benchmarks was quite small, so I was
worring that both the loop used for the merging and the call to
`xrealloc' was affecting too much the benchmarks.

> >> but with the advantage of disassociating mem usage from range width.
> >> I'll split the patch into two for the mem change and the cpu change,
> >> and might follow up with a subsequent patch to reinstate the bit array
> >> for the common case of small -[bcf] and no --output-delim.
> > My primary goal was to simplify the code. Even if the attached patch
> > dosen't work, I think that detecting small -[bcf] ranges would just make
> > the code more cumbersome.
> 
> Yes it's a trade off. For often used tools such as coreutils though,
> it's sometimes worth a little bit extra complexity for performance reasons.
> Here we might be able to guide the compiler around the branches like:
> 
> print_kth()
> {
>   if likely(bitarray_used)
>  ...
>   else
>  ...
> }
Ok.

> Anyway I'll wait for your patch before carefully considering
> to reinstate the bit array.
Please, give me some more time. I think that it would be possible to
use the sentinel to speed up things a bit.
[...]
> Sure. Eagerly waiting the patch :)
Attached.


Best regards,
Cojocaru Alexandru


avoid-branch.patch
Description: Binary data


bug#13127: [PATCH] cut: use only one data strucutre

2013-04-28 Thread Pádraig Brady
On 04/28/2013 07:14 PM, Cojocaru Alexandru wrote:
> On Sun, 28 Apr 2013 15:04:31 +0100
> Pádraig Brady  wrote:
> 
>> On 04/28/2013 12:44 PM, Cojocaru Alexandru wrote:
>>> Another problem may be the merging and the call to `xrealloc' that
>>> we do at the end of `set_fields'.
>>
>> That's only called at startup so I wouldn't worry too much.
>> What was your specific concern here?
> The file you used with the benchmarks was quite small, so I was
> worring that both the loop used for the merging and the call to
> `xrealloc' was affecting too much the benchmarks.

Ah right. I've enough to see relative differences quite easily,
but will increase further benchmark sizes if needed.

 but with the advantage of disassociating mem usage from range width.
 I'll split the patch into two for the mem change and the cpu change,
 and might follow up with a subsequent patch to reinstate the bit array
 for the common case of small -[bcf] and no --output-delim.
>>> My primary goal was to simplify the code. Even if the attached patch
>>> dosen't work, I think that detecting small -[bcf] ranges would just make
>>> the code more cumbersome.
>>
>> Yes it's a trade off. For often used tools such as coreutils though,
>> it's sometimes worth a little bit extra complexity for performance reasons.
>> Here we might be able to guide the compiler around the branches like:
>>
>> print_kth()
>> {
>>   if likely(bitarray_used)
>>  ...
>>   else
>>  ...
>> }
> Ok.
> 
>> Anyway I'll wait for your patch before carefully considering
>> to reinstate the bit array.
> Please, give me some more time. I think that it would be possible to
> use the sentinel to speed up things a bit.

Sure.

> [...]
>> Sure. Eagerly waiting the patch :)
> Attached.

That changes the else to an ||
I thought gcc would optimize that to the same code.
While the assembly generated is a little different,
the performance of both is essentially the same.

thanks,
Pádraig.





bug#13127: [PATCH] cut: use only one data strucutre

2013-04-28 Thread Pádraig Brady
So I reinstated the bit vector which was a little tricky
to do while maintaining performance, but it works very well.
So in summary with the attached 3 patch series, the CPU
usage of the common cut path is nearly halved, while the
max memory that will be allocated for the bit vector is 64KiB.

I'll apply this series in the morning.

thanks,
Pádraig.

p.s. I doubt adding a sentinel to the range pair structure
would out performance the bit vector approach, given the
significant benefit shown in the benchmark in the commit message.
>From 27a9eb50df1f5a49cb4c373de2b24eaa66cc2a11 Mon Sep 17 00:00:00 2001
From: Cojocaru Alexandru 
Date: Sun, 9 Dec 2012 10:43:10 +0100
Subject: [PATCH 1/3] cut: make memory allocation independent of range width

The current implementation of cut, uses a bit array,
an array of `struct range_pair's, and (when --output-delimiter
is specified) a hash_table.  The new implementation will use
only an array of `struct range_pair's.
The old implementation is memory inefficient because:
 1. When -b with a big num is specified, it allocates a lot of
memory for `printable_field'.
 2. When --output-delimiter is specified, it will allocate 31 buckets.
Even if a few ranges are specified.

Note CPU overhead is increased to determine if an item is to be printed,
as shown by:

$ yes abcdfeg | head -n1MB > big-file
$ for c in with-bitarray without-bitarray; do
src/cut-$c 2>/dev/null
echo -ne "\n== $c =="
time src/cut-$c -b1,3 big-file > /dev/null
  done

== with-bitarray ==
real0m0.084s
user0m0.078s
sys 0m0.006s

== without-bitarray ==
real0m0.111s
user0m0.108s
sys 0m0.002s

Subsequent patches will reduce this overhead.

* src/cut.c (set_fields): Set and initialize RP
instead of printable_field.
* src/cut.c (is_range_start_index): Use CURRENT_RP rather than a hash.
* tests/misc/cut.pl: Check if `eol_range_start' is set correctly.
* tests/misc/cut-huge-range.sh: Rename from cut-huge-to-eol-range.sh,
and add a test to verify large amounts of mem aren't allocated.
Fixes http://bugs.gnu.org/13127
---
 src/cut.c  |  294 +++-
 tests/local.mk |2 +-
 ...{cut-huge-to-eol-range.sh => cut-huge-range.sh} |4 +
 tests/misc/cut.pl  |4 +
 4 files changed, 111 insertions(+), 193 deletions(-)
 rename tests/misc/{cut-huge-to-eol-range.sh => cut-huge-range.sh} (84%)

diff --git a/src/cut.c b/src/cut.c
index 494aad7..8738c46 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -53,8 +53,31 @@
 }	\
   while (0)
 
+
+struct range_pair
+  {
+size_t lo;
+size_t hi;
+  };
+
+/* Array of `struct range_pair' holding all the finite ranges. */
+static struct range_pair *rp;
+
+/* Pointer inside RP.  When checking if a byte or field is selected
+   by a finite range, we check if it is between CURRENT_RP.LO
+   and CURRENT_RP.HI.  If the byte or field index is greater than
+   CURRENT_RP.HI then we make CURRENT_RP to point to the next range pair. */
+static struct range_pair *current_rp;
+
+/* Number of finite ranges specified by the user. */
+static size_t n_rp;
+
+/* Number of `struct range_pair's allocated. */
+static size_t n_rp_allocated;
+
+
 /* Append LOW, HIGH to the list RP of range pairs, allocating additional
-   space if necessary.  Update local variable N_RP.  When allocating,
+   space if necessary.  Update global variable N_RP.  When allocating,
update global variable N_RP_ALLOCATED.  */
 
 #define ADD_RANGE_PAIR(rp, low, high)			\
@@ -72,11 +95,6 @@
 }			\
   while (0)
 
-struct range_pair
-  {
-size_t lo;
-size_t hi;
-  };
 
 /* This buffer is used to support the semantics of the -s option
(or lack of same) when the specified field list includes (does
@@ -90,26 +108,11 @@ static char *field_1_buffer;
 /* The number of bytes allocated for FIELD_1_BUFFER.  */
 static size_t field_1_bufsize;
 
-/* The largest field or byte index used as an endpoint of a closed
-   or degenerate range specification;  this doesn't include the starting
-   index of right-open-ended ranges.  For example, with either range spec
-   '2-5,9-', '2-3,5,9-' this variable would be set to 5.  */
-static size_t max_range_endpoint;
 
 /* If nonzero, this is the index of the first field in a range that goes
to end of line. */
 static size_t eol_range_start;
 
-/* This is a bit vector.
-   In byte mode, which bytes to output.
-   In field mode, which DELIM-separated fields to output.
-   Both bytes and fields are numbered starting with 1,
-   so the zeroth bit of this array is unused.
-   A field or byte K has been selected if
-   (K <= MAX_RANGE_ENDPOINT and is_printable_field(K))
-|| (EOL_RANGE_START > 0 && K >= EOL_RANGE_START).  */
-static unsigned char *printable_field;
-
 enum operating_mode
   {
 undefined_mode,
@@ -148,15 +151,6 @@ static char *output_delimiter_string;
 /* True if we have ever read standard input. */
 sta