bug#13127: [PATCH] cut: use only one data strucutre

Pádraig Brady Sun, 28 Apr 2013 07:04:56 -0700

On 04/28/2013 12:44 PM, Cojocaru Alexandru wrote:
> On Sun, 28 Apr 2013 02:51:13 +0100
> Pádraig Brady <p...@draigbrady.com> wrote:
>> So looking in detail, this central print_kth function is of most importance 
>> to performance.
> I made the same conclusion as yours, see:
>       http://lists.gnu.org/archive/html/bug-coreutils/2012-12/msg00045.html
> 
>> I thought that your simplification of it might allow it to be auto inlined.
>> but I confirmed that gcc 4.6.0 -O2 does not do this at present by doing:
>>
>>   objdump -d src/cut.o | grep -q '<print_kth>:' && echo called || echo 
>> inlined
> With gcc 4.8.0 -O2 both `print_kth' and `is_range_start' are inlined
> even without the `inline' keyword:
>     nm src/cut | grep print_kth
>     nm src/cut | grep is_range_start
> both the above comands give me no output.


Good gcc is getting better.
We'll leave the inline for now at least,
to aid non bleeding edge gcc.

>> Marking it as inlined gives another gain as shown below.
>>
>> Testing these combinations, we have:
>> orig = bit array implementation
>> split = ditto + simplified print_kth
>> split-inline = ditto + inlined print_kth
>> mem = no bit array
>> mem-split = ditto + simplified print_kth
>> mem-inline = ditto + inlined print_kth
>>
>> $ yes abcdfeg | head -n1MB > big-file
>> $ for c in orig split split-inline mem mem-split mem-split-inline; do
>>     src/cut-$c 2>/dev/null
>>     echo -ne "\n== $c =="
>>     time src/cut-$c -b1,3 big-file > /dev/null
>>   done
>>
>> == orig ==
>> real 0m0.084s
>> user 0m0.078s
>> sys  0m0.006s
>>
>> == split ==
>> real 0m0.077s
>> user 0m0.070s
>> sys  0m0.006s
>>
>> == split-inline ==
>> real 0m0.055s
>> user 0m0.049s
>> sys  0m0.006s
>>
>> == mem ==
>> real 0m0.111s
>> user 0m0.108s
>> sys  0m0.002s
>>
>> == mem-split ==
>> real 0m0.088s
>> user 0m0.081s
>> sys  0m0.007s
>>
>> == mem-split-inline ==
>> real 0m0.070s
>> user 0m0.060s
>> sys  0m0.009s
>>
>> So in summary, removing the bit array does slow things down,
> I think that the problem lies in `print_kth' again. I've wrongly put
> an useless branch in it. See the attachment for a patch.

Did you forget to attach?

> Another problem may be the merging and the call to `xrealloc' that
> we do at the end of `set_fields'.

That's only called at startup so I wouldn't worry too much.
What was your specific concern here?

>> but with the advantage of disassociating mem usage from range width.
>> I'll split the patch into two for the mem change and the cpu change,
>> and might follow up with a subsequent patch to reinstate the bit array
>> for the common case of small -[bcf] and no --output-delim.
> My primary goal was to simplify the code. Even if the attached patch
> dosen't work, I think that detecting small -[bcf] ranges would just make
> the code more cumbersome.

Yes it's a trade off. For often used tools such as coreutils though,
it's sometimes worth a little bit extra complexity for performance reasons.
Here we might be able to guide the compiler around the branches like:

print_kth()
{
  if likely(bitarray_used)
     ...
  else
     ...
}

Anyway I'll wait for your patch before carefully considering
to reinstate the bit array.

>> That's a common trend in these mem adjustment patches.
>> I.E. Find a point to switch from the more CPU efficient method,
>> to one which is more memory efficient.
>>
>> thanks,
>> Pádraig.
> 
> Please could you re-run the benchmarks after applying the patch?
> Could you also try with a bigger file (for example 100MB)? So as
> to make the difference among the various patches more clear.
> Unfortunately I'm under an emulator and the benchmarks aren't very
> faithful.

Sure. Eagerly waiting the patch :)

Pádraig.

bug#13127: [PATCH] cut: use only one data strucutre

Reply via email to