Hi, I'm investigating SIMD support to C++ compute kernel(not gandiva).
A typical case is the sum kernel[1]. Below tight loop can be easily optimized with SIMD. for (int64_t i = 0; i < length; i++) { local.sum += values[i]; } Compiler already does loop vectorization. But it's done at compile time without knowledge of target cpu. Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled with only sse4 enabled is suboptimal on new hardware. I have some proposals, would like to hear comments from community. - Based on our experience of ISA-L[2] project(optimized storage acceleration library for x86 and Arm), runtime dispatcher is a good approach. Basically, it links in codes optimized for different cpu features(sse4,avx2,neon,...) and selects the best one fits target cpu at first invocation. This is similar to gcc indirect function[3], but doesn't depend on compilers. - Use gcc FMV [4] to generate multiple binaries for one function. See sample source and compiled code [5]. Though looks simple, it has many limitations: It's gcc specific feature, no support from clang and msvc. It only works on x86, no Arm support. I think this approach is no-go. - Don't do it. Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated effort to do it in C++ kernel? Will these vetorizable computations move to Gandiva in the future? [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106 [2] https://github.com/intel/isa-l [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/ [4] https://lwn.net/Articles/691932/ [5] https://godbolt.org/z/ajpuq_