On 04/28/2013 12:44 PM, Cojocaru Alexandru wrote: > On Sun, 28 Apr 2013 02:51:13 +0100 > Pádraig Brady <p...@draigbrady.com> wrote: >> So looking in detail, this central print_kth function is of most importance >> to performance. > I made the same conclusion as yours, see: > http://lists.gnu.org/archive/html/bug-coreutils/2012-12/msg00045.html > >> I thought that your simplification of it might allow it to be auto inlined. >> but I confirmed that gcc 4.6.0 -O2 does not do this at present by doing: >> >> objdump -d src/cut.o | grep -q '<print_kth>:' && echo called || echo >> inlined > With gcc 4.8.0 -O2 both `print_kth' and `is_range_start' are inlined > even without the `inline' keyword: > nm src/cut | grep print_kth > nm src/cut | grep is_range_start > both the above comands give me no output.
Good gcc is getting better. We'll leave the inline for now at least, to aid non bleeding edge gcc. >> Marking it as inlined gives another gain as shown below. >> >> Testing these combinations, we have: >> orig = bit array implementation >> split = ditto + simplified print_kth >> split-inline = ditto + inlined print_kth >> mem = no bit array >> mem-split = ditto + simplified print_kth >> mem-inline = ditto + inlined print_kth >> >> $ yes abcdfeg | head -n1MB > big-file >> $ for c in orig split split-inline mem mem-split mem-split-inline; do >> src/cut-$c 2>/dev/null >> echo -ne "\n== $c ==" >> time src/cut-$c -b1,3 big-file > /dev/null >> done >> >> == orig == >> real 0m0.084s >> user 0m0.078s >> sys 0m0.006s >> >> == split == >> real 0m0.077s >> user 0m0.070s >> sys 0m0.006s >> >> == split-inline == >> real 0m0.055s >> user 0m0.049s >> sys 0m0.006s >> >> == mem == >> real 0m0.111s >> user 0m0.108s >> sys 0m0.002s >> >> == mem-split == >> real 0m0.088s >> user 0m0.081s >> sys 0m0.007s >> >> == mem-split-inline == >> real 0m0.070s >> user 0m0.060s >> sys 0m0.009s >> >> So in summary, removing the bit array does slow things down, > I think that the problem lies in `print_kth' again. I've wrongly put > an useless branch in it. See the attachment for a patch. Did you forget to attach? > Another problem may be the merging and the call to `xrealloc' that > we do at the end of `set_fields'. That's only called at startup so I wouldn't worry too much. What was your specific concern here? >> but with the advantage of disassociating mem usage from range width. >> I'll split the patch into two for the mem change and the cpu change, >> and might follow up with a subsequent patch to reinstate the bit array >> for the common case of small -[bcf] and no --output-delim. > My primary goal was to simplify the code. Even if the attached patch > dosen't work, I think that detecting small -[bcf] ranges would just make > the code more cumbersome. Yes it's a trade off. For often used tools such as coreutils though, it's sometimes worth a little bit extra complexity for performance reasons. Here we might be able to guide the compiler around the branches like: print_kth() { if likely(bitarray_used) ... else ... } Anyway I'll wait for your patch before carefully considering to reinstate the bit array. >> That's a common trend in these mem adjustment patches. >> I.E. Find a point to switch from the more CPU efficient method, >> to one which is more memory efficient. >> >> thanks, >> Pádraig. > > Please could you re-run the benchmarks after applying the patch? > Could you also try with a bigger file (for example 100MB)? So as > to make the difference among the various patches more clear. > Unfortunately I'm under an emulator and the benchmarks aren't very > faithful. Sure. Eagerly waiting the patch :) Pádraig.