Hi all, I'm looking into this now, and I feel like there's a potential memory corruption at the very end of the out_data_ array.
algo: bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position); BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance); out_data_[out_position_] = values_data_[in_position]; out_position_ += advance; // may need static_cast<int> here ++in_position; say, in.data = [1, 2, 3, 4] // all valid filt.data = [1, 1, 0, 0] // all valid At in_position = 1, out_position becomes 2. And till the end of the loop (i.e. in_position <= 3), out_position[2] will be updated (but with out_is_valid_[2] set to 0). I belive at the end of the loop, following would be the output. out.data = [1, 2, 4] out.valid = [1, 1, 0] Wouldn't this be a problem? On Fri, Jun 25, 2021 at 12:19 AM Nate Bauernfeind < natebauernfe...@deephaven.io> wrote: > > Basically, it reset/set the borrow bit in eflag register based on the if > condition, and runs `outpos = outpos - (-1) - borrow_bit`. > > That's clever, and I clearly didn't see that! > > On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai <yibo....@arm.com> wrote: > > > > > > > On 6/25/21 6:58 AM, Nate Bauernfeind wrote: > > > FYI, the bench was slightly broken; but the results stand. > > > > > >> benchmark::DoNotOptimize(output[rand()]); > > > Since rand() has a domain of 0 to MAX_INT it blows past the output > array > > > (of length 4k). It segfaults in GCC; I'm not sure why the Clang > benchmark > > > is happy with that. > > > > > > I modified [1] it to: > > >> benchmark::DoNotOptimize(output[rand() % N]); > > > > > > The benchmarks run: > > > The Clang11 -O3 speed up is 2.3x. > > > The GCC10.2 -O3 speed up is 2.6x. > > > > > > Interestingly, I added a second branching benchmark. The original > > branching > > > does this: > > > ``` > > > if (bitmap[i/8] & (1 << (i%8))) { > > > output[outpos++] = input[i]; > > > } > > > > > ``` > > > > > > My additional branching benchmark pulls the assignment outside of the > > > branch: > > > ``` > > > output[outpos] = input[i]; > > > if (bitmap[i/8] & (1 << (i%8))) { > > > ++outpos; > > > } > > > ``` > > > > From the disassembler, compiler optimizes this c code with below > > branch-less assembly code. > > > > 10.06% sar %cl,%edx > > 10.74% and $0x1,%edx > > 9.93% cmp $0x1,%edx > > sbb $0xffffffff,%esi > > > > Basically, it reset/set the borrow bit in eflag register based on the if > > condition, and runs `outpos = outpos - (-1) - borrow_bit`. > > > > > > > > The benchmarks run: > > > The GCC10.2 -O3 speed up compared to the original branching code is > 2.3x. > > > (are you as surprised as me?) > > > The GCC10.2 -O3 speed up compared to the original non-branching code is > > > 0.9x. (yes; it is slightly slower) > > > > > > Point: reducing the footprint of a false branch prediction is as > > worthwhile > > > an investment when you can't simply get rid of the conditional. > > > > > > Nate > > > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g > > > > > > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera < > niranda.per...@gmail.com > > > > > > wrote: > > > > > >> I created a JIRA for this. I will do the changes in select kernels and > > >> report back with benchmark results > > >> https://issues.apache.org/jira/browse/ARROW-13170 > > >> > > >> > > >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote: > > >> > > >>> Did a quick test. For random bitmaps and my trivial test code, the > > >>> branch-less code is 3.5x faster than branch one. > > >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro > > >>> > > >>> On 6/23/21 11:21 PM, Wes McKinney wrote: > > >>>> One project I was interested in getting to but haven't had the time > > >>>> was introducing branch-free code into vector_selection.cc and > reducing > > >>>> the use of if-statements to try to improve performance. > > >>>> > > >>>> One way to do this is to take code that looks like this: > > >>>> > > >>>> if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) { > > >>>> BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_); > > >>>> out_data_[out_position_++] = values_data_[in_position]; > > >>>> } > > >>>> ++in_position; > > >>>> > > >>>> and change it to a branch-free version > > >>>> > > >>>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + > > >>> in_position); > > >>>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, > > advance); > > >>>> out_data_[out_position_] = values_data_[in_position]; > > >>>> out_position_ += advance; // may need static_cast<int> here > > >>>> ++in_position; > > >>>> > > >>>> Since more people are working on kernels and computing now, I > thought > > >>>> this might be an interesting project for someone to explore and see > > >>>> what improvements are possible (and what the differences between > e.g. > > >>>> x86 and ARM architecture are like when it comes to reducing > > >>>> branching). Another thing to look at might be batch-at-a-time > > >>>> bitpacking in the output bitmap versus bit-at-a-time. > > >>>> > > >>> > > >> > > >> > > >> -- > > >> Niranda Perera > > >> https://niranda.dev/ > > >> @n1r44 <https://twitter.com/N1R44> > > >> > > > > > > > > > -- > > > > > > > > -- > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>