One project I was interested in getting to but haven't had the time was introducing branch-free code into vector_selection.cc and reducing the use of if-statements to try to improve performance.
One way to do this is to take code that looks like this: if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) { BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_); out_data_[out_position_++] = values_data_[in_position]; } ++in_position; and change it to a branch-free version bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position); BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance); out_data_[out_position_] = values_data_[in_position]; out_position_ += advance; // may need static_cast<int> here ++in_position; Since more people are working on kernels and computing now, I thought this might be an interesting project for someone to explore and see what improvements are possible (and what the differences between e.g. x86 and ARM architecture are like when it comes to reducing branching). Another thing to look at might be batch-at-a-time bitpacking in the output bitmap versus bit-at-a-time.