I would lean against adding another library dependency.  My main concerns
with adding another library dependency are:
1.  Supporting it across all of the build tool-chains (using a GCC specific
option would be my least favorite approach).
2.  Distributed binary size (for wheels at least people seem to care).

I would like lean more towards yes if there were some real world benchmarks
showing the a substantial performance gain.

I don't think it is unreasonable to package our binaries targeting a common
instruction set (e.g. AVX 1 or 2).  For those that want to make full use of
their latest hardware compiling from source doesn't seem unreasonable,
especially given the recent effort to trim dependencies.

Cheers,
Micah



On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi,
>
> I would recommend against reinventing the wheel.  It would be possible
> to reuse an existing C++ SIMD library.  There are several of them (Vc,
> xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible
> answer.
>
> Regards
>
> Antoine.
>
>
> Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> > Hi,
> >
> > I'm investigating SIMD support to C++ compute kernel(not gandiva).
> >
> > A typical case is the sum kernel[1]. Below tight loop can be easily
> optimized with SIMD.
> >
> > for (int64_t i = 0; i < length; i++) {
> >    local.sum += values[i];
> > }
> >
> > Compiler already does loop vectorization. But it's done at compile time
> without knowledge of target cpu.
> > Binaries compiled with avx-512 cannot run on old cpu, while binaries
> compiled with only sse4 enabled is suboptimal on new hardware.
> >
> > I have some proposals, would like to hear comments from community.
> >
> > - Based on our experience of ISA-L[2] project(optimized storage
> acceleration library for x86 and Arm), runtime dispatcher is a good
> approach. Basically, it links in codes optimized for different cpu
> features(sse4,avx2,neon,...) and selects the best one fits target cpu at
> first invocation. This is similar to gcc indirect function[3], but doesn't
> depend on compilers.
> >
> > - Use gcc FMV [4] to generate multiple binaries for one function. See
> sample source and compiled code [5].
> >    Though looks simple, it has many limitations: It's gcc specific
> feature, no support from clang and msvc. It only works on x86, no Arm
> support.
> >    I think this approach is no-go.
> >
> > - Don't do it.
> >    Gandiva leverages LLVM JIT for runtime code optimization. Is it
> duplicated effort to do it in C++ kernel? Will these vetorizable
> computations move to Gandiva in the future?
> >
> > [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
> > [2] https://github.com/intel/isa-l
> > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
> > [4] https://lwn.net/Articles/691932/
> > [5] https://godbolt.org/z/ajpuq_
> >
>

Reply via email to