I created a JIRA for this. I will do the changes in select kernels and report back with benchmark results https://issues.apache.org/jira/browse/ARROW-13170
On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai <yibo....@arm.com> wrote: > Did a quick test. For random bitmaps and my trivial test code, the > branch-less code is 3.5x faster than branch one. > https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro > > On 6/23/21 11:21 PM, Wes McKinney wrote: > > One project I was interested in getting to but haven't had the time > > was introducing branch-free code into vector_selection.cc and reducing > > the use of if-statements to try to improve performance. > > > > One way to do this is to take code that looks like this: > > > > if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) { > > BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_); > > out_data_[out_position_++] = values_data_[in_position]; > > } > > ++in_position; > > > > and change it to a branch-free version > > > > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + > in_position); > > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance); > > out_data_[out_position_] = values_data_[in_position]; > > out_position_ += advance; // may need static_cast<int> here > > ++in_position; > > > > Since more people are working on kernels and computing now, I thought > > this might be an interesting project for someone to explore and see > > what improvements are possible (and what the differences between e.g. > > x86 and ARM architecture are like when it comes to reducing > > branching). Another thing to look at might be batch-at-a-time > > bitpacking in the output bitmap versus bit-at-a-time. > > > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>