Hi,

I'm investigating SIMD support to C++ compute kernel(not gandiva).

A typical case is the sum kernel[1]. Below tight loop can be easily optimized 
with SIMD.

for (int64_t i = 0; i < length; i++) {
  local.sum += values[i];
}

Compiler already does loop vectorization. But it's done at compile time without 
knowledge of target cpu.
Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled 
with only sse4 enabled is suboptimal on new hardware.

I have some proposals, would like to hear comments from community.

- Based on our experience of ISA-L[2] project(optimized storage acceleration 
library for x86 and Arm), runtime dispatcher is a good approach. Basically, it 
links in codes optimized for different cpu features(sse4,avx2,neon,...) and 
selects the best one fits target cpu at first invocation. This is similar to 
gcc indirect function[3], but doesn't depend on compilers.

- Use gcc FMV [4] to generate multiple binaries for one function. See sample 
source and compiled code [5].
  Though looks simple, it has many limitations: It's gcc specific feature, no 
support from clang and msvc. It only works on x86, no Arm support.
  I think this approach is no-go.

- Don't do it.
  Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated 
effort to do it in C++ kernel? Will these vetorizable computations move to 
Gandiva in the future?

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
[2] https://github.com/intel/isa-l
[3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
[4] https://lwn.net/Articles/691932/
[5] https://godbolt.org/z/ajpuq_

Reply via email to