vector_selection.cc

Nate Bauernfeind Thu, 24 Jun 2021 21:19:04 -0700

> Basically, it reset/set the borrow bit in eflag register based on the if
condition, and runs `outpos = outpos - (-1) - borrow_bit`.


That's clever, and I clearly didn't see that!

On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai <yibo....@arm.com> wrote:

>
>
> On 6/25/21 6:58 AM, Nate Bauernfeind wrote:
> > FYI, the bench was slightly broken; but the results stand.
> >
> >> benchmark::DoNotOptimize(output[rand()]);
> > Since rand() has a domain of 0 to MAX_INT it blows past the output array
> > (of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
> > is happy with that.
> >
> > I modified [1] it to:
> >> benchmark::DoNotOptimize(output[rand() % N]);
> >
> > The benchmarks run:
> > The Clang11 -O3 speed up is 2.3x.
> > The GCC10.2 -O3 speed up is 2.6x.
> >
> > Interestingly, I added a second branching benchmark. The original
> branching
> > does this:
> > ```
> >        if (bitmap[i/8] & (1 << (i%8))) {
> >          output[outpos++] = input[i];
> >        }
> >
> ```
> >
> > My additional branching benchmark pulls the assignment outside of the
> > branch:
> > ```
> >        output[outpos] = input[i];
> >        if (bitmap[i/8] & (1 << (i%8))) {
> >          ++outpos;
> >        }
> > ```
>
>  From the disassembler, compiler optimizes this c code with below
> branch-less assembly code.
>
> 10.06% sar    %cl,%edx
> 10.74% and    $0x1,%edx
> 9.93%  cmp    $0x1,%edx
>         sbb    $0xffffffff,%esi
>
> Basically, it reset/set the borrow bit in eflag register based on the if
> condition, and runs `outpos = outpos - (-1) - borrow_bit`.
>
> >
> > The benchmarks run:
> > The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
> > (are you as surprised as me?)
> > The GCC10.2 -O3 speed up compared to the original non-branching code is
> > 0.9x. (yes; it is slightly slower)
> >
> > Point: reducing the footprint of a false branch prediction is as
> worthwhile
> > an investment when you can't simply get rid of the conditional.
> >
> > Nate
> > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g
> >
> > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <niranda.per...@gmail.com
> >
> > wrote:
> >
> >> I created a JIRA for this. I will do the changes in select kernels and
> >> report back with benchmark results
> >> https://issues.apache.org/jira/browse/ARROW-13170
> >>
> >>
> >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote:
> >>
> >>> Did a quick test. For random bitmaps and my trivial test code, the
> >>> branch-less code is 3.5x faster than branch one.
> >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> >>>
> >>> On 6/23/21 11:21 PM, Wes McKinney wrote:
> >>>> One project I was interested in getting to but haven't had the time
> >>>> was introducing branch-free code into vector_selection.cc and reducing
> >>>> the use of if-statements to try to improve performance.
> >>>>
> >>>> One way to do this is to take code that looks like this:
> >>>>
> >>>> if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> >>>>     BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> >>>>     out_data_[out_position_++] = values_data_[in_position];
> >>>> }
> >>>> ++in_position;
> >>>>
> >>>> and change it to a branch-free version
> >>>>
> >>>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> >>> in_position);
> >>>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
> advance);
> >>>> out_data_[out_position_] = values_data_[in_position];
> >>>> out_position_ += advance; // may need static_cast<int> here
> >>>> ++in_position;
> >>>>
> >>>> Since more people are working on kernels and computing now, I thought
> >>>> this might be an interesting project for someone to explore and see
> >>>> what improvements are possible (and what the differences between e.g.
> >>>> x86 and ARM architecture are like when it comes to reducing
> >>>> branching). Another thing to look at might be batch-at-a-time
> >>>> bitpacking in the output bitmap versus bit-at-a-time.
> >>>>
> >>>
> >>
> >>
> >> --
> >> Niranda Perera
> >> https://niranda.dev/
> >> @n1r44 <https://twitter.com/N1R44>
> >>
> >
> >
> > --
> >
>


--

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

Reply via email to