Thanks for the interesting test, David. At Tue, 14 Nov 2017 17:00:12 +1300, David Rowley <david.row...@2ndquadrant.com> wrote in <CAKJS1f-1Lc_b=Y1iicPQzvUgSn1keHSgmRqLuOGq_VR6M==z...@mail.gmail.com> > On 13 November 2017 at 22:46, Amit Langote > <langote_amit...@lab.ntt.co.jp> wrote: > > On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: > >> Without some aggressive optimization, the loop takes most of the > >> time to check-and-jump for nothing especially with many > >> partitions and somewhat unintuitive. > >> > >> The following uses a bit tricky bitmap operation but > >> is straightforward as a whole. > >> > >> ===== > >> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ > >> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
'+='.. ^^; > >> /* fill up intermediate words */ > >> while (wordnum < uwordnum) > >> a->words[wordnum++] = ~(bitmapword) 0; > >> > >> /* fill up to BITNUM(upper) bit (0-based) of the last word */ > >> a->workds[wordnum++] |= > >> (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); > >> ===== > > > > Considering also the David's comment downthread, I will try to incorporate > > this into bms_add_range(). > > I've attached an implementation of the patch using this method. > > I've also attached bitset.c which runs each through their paces. I'd > have expected Kyotaro's method to be faster, but gcc 7.2 with -O2 > generates very slightly slower code. I didn't really check why. clang > seems to do a better job with it. .. > $ gcc -O2 bitset.c -o bitset && ./bitset > bms_add_range in 0.694254 (6.94254 ns per loop) > bms_add_range2 in 0.726643 (7.26643 ns per loop) .. > $ gcc --version > gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0 > Copyright (C) 2017 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Hmm bms_add_range doesn't seem getting so aggressive optimization but I had a similar result. Looking the output of gcc -S, I found that bms_add_range() is embedded in main(). (gcc 7.1.0) It's not surprizing after finding that but.. Anyway I added __attribute((noinline)) to the two functions and got the following result. > bms_add_range in 1.24 (12.4 ns per loop) > bms_add_range2 in 0.8 (8 ns per loop) It seems reasonable. > $ clang -O2 bitset.c -o bitset && ./bitset > bms_add_range in 0.866554 (8.66554 ns per loop) > bms_add_range2 in 0.467138 (4.67138 ns per loop) .. > $ clang --version > clang version 4.0.1-6 (tags/RELEASE_401/final) > Target: x86_64-pc-linux-gnu > Thread model: posix > InstalledDir: /usr/bin > > Probably just go with Kyotaro's idea (v2). I don't think this is worth > debating, I just wanted to show it's not that clear-cut. I agree that it's not so clear-cut. regard, -- Kyotaro Horiguchi NTT Open Source Software Center