> Basically, it reset/set the borrow bit in eflag register based on the if condition, and runs `outpos = outpos - (-1) - borrow_bit`.
That's clever, and I clearly didn't see that! On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai <yibo....@arm.com> wrote: > > > On 6/25/21 6:58 AM, Nate Bauernfeind wrote: > > FYI, the bench was slightly broken; but the results stand. > > > >> benchmark::DoNotOptimize(output[rand()]); > > Since rand() has a domain of 0 to MAX_INT it blows past the output array > > (of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark > > is happy with that. > > > > I modified [1] it to: > >> benchmark::DoNotOptimize(output[rand() % N]); > > > > The benchmarks run: > > The Clang11 -O3 speed up is 2.3x. > > The GCC10.2 -O3 speed up is 2.6x. > > > > Interestingly, I added a second branching benchmark. The original > branching > > does this: > > ``` > > if (bitmap[i/8] & (1 << (i%8))) { > > output[outpos++] = input[i]; > > } > > > ``` > > > > My additional branching benchmark pulls the assignment outside of the > > branch: > > ``` > > output[outpos] = input[i]; > > if (bitmap[i/8] & (1 << (i%8))) { > > ++outpos; > > } > > ``` > > From the disassembler, compiler optimizes this c code with below > branch-less assembly code. > > 10.06% sar %cl,%edx > 10.74% and $0x1,%edx > 9.93% cmp $0x1,%edx > sbb $0xffffffff,%esi > > Basically, it reset/set the borrow bit in eflag register based on the if > condition, and runs `outpos = outpos - (-1) - borrow_bit`. > > > > > The benchmarks run: > > The GCC10.2 -O3 speed up compared to the original branching code is 2.3x. > > (are you as surprised as me?) > > The GCC10.2 -O3 speed up compared to the original non-branching code is > > 0.9x. (yes; it is slightly slower) > > > > Point: reducing the footprint of a false branch prediction is as > worthwhile > > an investment when you can't simply get rid of the conditional. > > > > Nate > > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g > > > > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <niranda.per...@gmail.com > > > > wrote: > > > >> I created a JIRA for this. I will do the changes in select kernels and > >> report back with benchmark results > >> https://issues.apache.org/jira/browse/ARROW-13170 > >> > >> > >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote: > >> > >>> Did a quick test. For random bitmaps and my trivial test code, the > >>> branch-less code is 3.5x faster than branch one. > >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro > >>> > >>> On 6/23/21 11:21 PM, Wes McKinney wrote: > >>>> One project I was interested in getting to but haven't had the time > >>>> was introducing branch-free code into vector_selection.cc and reducing > >>>> the use of if-statements to try to improve performance. > >>>> > >>>> One way to do this is to take code that looks like this: > >>>> > >>>> if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) { > >>>> BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_); > >>>> out_data_[out_position_++] = values_data_[in_position]; > >>>> } > >>>> ++in_position; > >>>> > >>>> and change it to a branch-free version > >>>> > >>>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + > >>> in_position); > >>>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, > advance); > >>>> out_data_[out_position_] = values_data_[in_position]; > >>>> out_position_ += advance; // may need static_cast<int> here > >>>> ++in_position; > >>>> > >>>> Since more people are working on kernels and computing now, I thought > >>>> this might be an interesting project for someone to explore and see > >>>> what improvements are possible (and what the differences between e.g. > >>>> x86 and ARM architecture are like when it comes to reducing > >>>> branching). Another thing to look at might be batch-at-a-time > >>>> bitpacking in the output bitmap versus bit-at-a-time. > >>>> > >>> > >> > >> > >> -- > >> Niranda Perera > >> https://niranda.dev/ > >> @n1r44 <https://twitter.com/N1R44> > >> > > > > > > -- > > > --