FYI, the bench was slightly broken; but the results stand.

> benchmark::DoNotOptimize(output[rand()]);
Since rand() has a domain of 0 to MAX_INT it blows past the output array
(of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
is happy with that.

I modified [1] it to:
> benchmark::DoNotOptimize(output[rand() % N]);

The benchmarks run:
The Clang11 -O3 speed up is 2.3x.
The GCC10.2 -O3 speed up is 2.6x.

Interestingly, I added a second branching benchmark. The original branching
does this:
```
      if (bitmap[i/8] & (1 << (i%8))) {
        output[outpos++] = input[i];
      }
```

My additional branching benchmark pulls the assignment outside of the
branch:
```
      output[outpos] = input[i];
      if (bitmap[i/8] & (1 << (i%8))) {
        ++outpos;
      }
```

The benchmarks run:
The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
(are you as surprised as me?)
The GCC10.2 -O3 speed up compared to the original non-branching code is
0.9x. (yes; it is slightly slower)

Point: reducing the footprint of a false branch prediction is as worthwhile
an investment when you can't simply get rid of the conditional.

Nate
[1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g

On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <niranda.per...@gmail.com>
wrote:

> I created a JIRA for this. I will do the changes in select kernels and
> report back with benchmark results
> https://issues.apache.org/jira/browse/ARROW-13170
>
>
> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote:
>
> > Did a quick test. For random bitmaps and my trivial test code, the
> > branch-less code is 3.5x faster than branch one.
> > https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> >
> > On 6/23/21 11:21 PM, Wes McKinney wrote:
> > > One project I was interested in getting to but haven't had the time
> > > was introducing branch-free code into vector_selection.cc and reducing
> > > the use of if-statements to try to improve performance.
> > >
> > > One way to do this is to take code that looks like this:
> > >
> > > if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> > >    BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> > >    out_data_[out_position_++] = values_data_[in_position];
> > > }
> > > ++in_position;
> > >
> > > and change it to a branch-free version
> > >
> > > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> > in_position);
> > > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> > > out_data_[out_position_] = values_data_[in_position];
> > > out_position_ += advance; // may need static_cast<int> here
> > > ++in_position;
> > >
> > > Since more people are working on kernels and computing now, I thought
> > > this might be an interesting project for someone to explore and see
> > > what improvements are possible (and what the differences between e.g.
> > > x86 and ARM architecture are like when it comes to reducing
> > > branching). Another thing to look at might be batch-at-a-time
> > > bitpacking in the output bitmap versus bit-at-a-time.
> > >
> >
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>


--

Reply via email to