FYI, the bench was slightly broken; but the results stand. > benchmark::DoNotOptimize(output[rand()]); Since rand() has a domain of 0 to MAX_INT it blows past the output array (of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark is happy with that.
I modified [1] it to: > benchmark::DoNotOptimize(output[rand() % N]); The benchmarks run: The Clang11 -O3 speed up is 2.3x. The GCC10.2 -O3 speed up is 2.6x. Interestingly, I added a second branching benchmark. The original branching does this: ``` if (bitmap[i/8] & (1 << (i%8))) { output[outpos++] = input[i]; } ``` My additional branching benchmark pulls the assignment outside of the branch: ``` output[outpos] = input[i]; if (bitmap[i/8] & (1 << (i%8))) { ++outpos; } ``` The benchmarks run: The GCC10.2 -O3 speed up compared to the original branching code is 2.3x. (are you as surprised as me?) The GCC10.2 -O3 speed up compared to the original non-branching code is 0.9x. (yes; it is slightly slower) Point: reducing the footprint of a false branch prediction is as worthwhile an investment when you can't simply get rid of the conditional. Nate [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <niranda.per...@gmail.com> wrote: > I created a JIRA for this. I will do the changes in select kernels and > report back with benchmark results > https://issues.apache.org/jira/browse/ARROW-13170 > > > On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote: > > > Did a quick test. For random bitmaps and my trivial test code, the > > branch-less code is 3.5x faster than branch one. > > https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro > > > > On 6/23/21 11:21 PM, Wes McKinney wrote: > > > One project I was interested in getting to but haven't had the time > > > was introducing branch-free code into vector_selection.cc and reducing > > > the use of if-statements to try to improve performance. > > > > > > One way to do this is to take code that looks like this: > > > > > > if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) { > > > BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_); > > > out_data_[out_position_++] = values_data_[in_position]; > > > } > > > ++in_position; > > > > > > and change it to a branch-free version > > > > > > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + > > in_position); > > > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance); > > > out_data_[out_position_] = values_data_[in_position]; > > > out_position_ += advance; // may need static_cast<int> here > > > ++in_position; > > > > > > Since more people are working on kernels and computing now, I thought > > > this might be an interesting project for someone to explore and see > > > what improvements are possible (and what the differences between e.g. > > > x86 and ARM architecture are like when it comes to reducing > > > branching). Another thing to look at might be batch-at-a-time > > > bitpacking in the output bitmap versus bit-at-a-time. > > > > > > > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > --