Hi everyone, I've noticed that we include xsimd as an abstraction over all of the simd architectures. I'd like to propose a different solution which would result in fewer lines of code, while being more readable.
My thinking is that anything simple enough to abstract with xsimd can be autovectorized by the compiler. Any more interesting SIMD algorithm usually is tailored to the target instruction set and can't be abstracted away with xsimd anyway. With that in mind, I'd like to propose the following strategy: 1. Write a single source file with simple, element-at-a-time for loop implementations of each function. 2. Compile this same source file several times with different compile flags for different vectorization (e.g. if we're on an x86 machine that supports AVX2 and AVX512, we'd compile once with -mavx2 and once with -mavx512vl). 3. Functions compiled with different instruction sets can be differentiated by a namespace, which gets defined during the compiler invocation. For example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2 and then for something like elementwise addition of two arrays, we'd call arrow::AVX2::VectorAdd. I believe this would let us remove xsimd as a dependency while also giving us lots of vectorized kernels at the cost of some extra cmake magic. After that, it would just be a matter of making the function registry point to these new functions. Please let me know your thoughts! Thanks, Sasha Krassovsky