vector_selection.cc

Yibo Cai Thu, 24 Jun 2021 19:57:25 -0700



On 6/25/21 6:58 AM, Nate Bauernfeind wrote:

FYI, the bench was slightly broken; but the results stand.

benchmark::DoNotOptimize(output[rand()]);

Since rand() has a domain of 0 to MAX_INT it blows past the output array
(of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
is happy with that.

I modified [1] it to:

benchmark::DoNotOptimize(output[rand() % N]);


The benchmarks run:
The Clang11 -O3 speed up is 2.3x.
The GCC10.2 -O3 speed up is 2.6x.

Interestingly, I added a second branching benchmark. The original branching
does this:
```
       if (bitmap[i/8] & (1 << (i%8))) {
         output[outpos++] = input[i];
       }

```


My additional branching benchmark pulls the assignment outside of the
branch:
```
       output[outpos] = input[i];
       if (bitmap[i/8] & (1 << (i%8))) {
         ++outpos;
       }
```


From the disassembler, compiler optimizes this c code with below
branch-less assembly code.

10.06% sar    %cl,%edx
10.74% and    $0x1,%edx
9.93%  cmp    $0x1,%edx
       sbb    $0xffffffff,%esi

Basically, it reset/set the borrow bit in eflag register based on the ifcondition, and runs `outpos = outpos - (-1) - borrow_bit`.


The benchmarks run:
The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
(are you as surprised as me?)
The GCC10.2 -O3 speed up compared to the original non-branching code is
0.9x. (yes; it is slightly slower)

Point: reducing the footprint of a false branch prediction is as worthwhile
an investment when you can't simply get rid of the conditional.

Nate
[1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g

On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <[email protected]>
wrote:

I created a JIRA for this. I will do the changes in select kernels and
report back with benchmark results
https://issues.apache.org/jira/browse/ARROW-13170


On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <[email protected]> wrote:

Did a quick test. For random bitmaps and my trivial test code, the
branch-less code is 3.5x faster than branch one.
https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:

One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
    BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
    out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +

in_position);

BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast<int> here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.



--
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

--

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

Reply via email to