Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast<int> here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


On Fri, Jun 25, 2021 at 12:19 AM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > Basically, it reset/set the borrow bit in eflag register based on the if
> condition, and runs `outpos = outpos - (-1) - borrow_bit`.
>
> That's clever, and I clearly didn't see that!
>
> On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai <yibo....@arm.com> wrote:
>
> >
> >
> > On 6/25/21 6:58 AM, Nate Bauernfeind wrote:
> > > FYI, the bench was slightly broken; but the results stand.
> > >
> > >> benchmark::DoNotOptimize(output[rand()]);
> > > Since rand() has a domain of 0 to MAX_INT it blows past the output
> array
> > > (of length 4k). It segfaults in GCC; I'm not sure why the Clang
> benchmark
> > > is happy with that.
> > >
> > > I modified [1] it to:
> > >> benchmark::DoNotOptimize(output[rand() % N]);
> > >
> > > The benchmarks run:
> > > The Clang11 -O3 speed up is 2.3x.
> > > The GCC10.2 -O3 speed up is 2.6x.
> > >
> > > Interestingly, I added a second branching benchmark. The original
> > branching
> > > does this:
> > > ```
> > >        if (bitmap[i/8] & (1 << (i%8))) {
> > >          output[outpos++] = input[i];
> > >        }
> > >
> > ```
> > >
> > > My additional branching benchmark pulls the assignment outside of the
> > > branch:
> > > ```
> > >        output[outpos] = input[i];
> > >        if (bitmap[i/8] & (1 << (i%8))) {
> > >          ++outpos;
> > >        }
> > > ```
> >
> >  From the disassembler, compiler optimizes this c code with below
> > branch-less assembly code.
> >
> > 10.06% sar    %cl,%edx
> > 10.74% and    $0x1,%edx
> > 9.93%  cmp    $0x1,%edx
> >         sbb    $0xffffffff,%esi
> >
> > Basically, it reset/set the borrow bit in eflag register based on the if
> > condition, and runs `outpos = outpos - (-1) - borrow_bit`.
> >
> > >
> > > The benchmarks run:
> > > The GCC10.2 -O3 speed up compared to the original branching code is
> 2.3x.
> > > (are you as surprised as me?)
> > > The GCC10.2 -O3 speed up compared to the original non-branching code is
> > > 0.9x. (yes; it is slightly slower)
> > >
> > > Point: reducing the footprint of a false branch prediction is as
> > worthwhile
> > > an investment when you can't simply get rid of the conditional.
> > >
> > > Nate
> > > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g
> > >
> > > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <
> niranda.per...@gmail.com
> > >
> > > wrote:
> > >
> > >> I created a JIRA for this. I will do the changes in select kernels and
> > >> report back with benchmark results
> > >> https://issues.apache.org/jira/browse/ARROW-13170
> > >>
> > >>
> > >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote:
> > >>
> > >>> Did a quick test. For random bitmaps and my trivial test code, the
> > >>> branch-less code is 3.5x faster than branch one.
> > >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> > >>>
> > >>> On 6/23/21 11:21 PM, Wes McKinney wrote:
> > >>>> One project I was interested in getting to but haven't had the time
> > >>>> was introducing branch-free code into vector_selection.cc and
> reducing
> > >>>> the use of if-statements to try to improve performance.
> > >>>>
> > >>>> One way to do this is to take code that looks like this:
> > >>>>
> > >>>> if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> > >>>>     BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> > >>>>     out_data_[out_position_++] = values_data_[in_position];
> > >>>> }
> > >>>> ++in_position;
> > >>>>
> > >>>> and change it to a branch-free version
> > >>>>
> > >>>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> > >>> in_position);
> > >>>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
> > advance);
> > >>>> out_data_[out_position_] = values_data_[in_position];
> > >>>> out_position_ += advance; // may need static_cast<int> here
> > >>>> ++in_position;
> > >>>>
> > >>>> Since more people are working on kernels and computing now, I
> thought
> > >>>> this might be an interesting project for someone to explore and see
> > >>>> what improvements are possible (and what the differences between
> e.g.
> > >>>> x86 and ARM architecture are like when it comes to reducing
> > >>>> branching). Another thing to look at might be batch-at-a-time
> > >>>> bitpacking in the output bitmap versus bit-at-a-time.
> > >>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Niranda Perera
> > >> https://niranda.dev/
> > >> @n1r44 <https://twitter.com/N1R44>
> > >>
> > >
> > >
> > > --
> > >
> >
>
>
> --
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to