As far as I can tell, we currently have no simd for any of our kernels. The
only source files compiled with e.g. AVX2 support are ones having to do
with the hash table used for group by and join (and one called
`aggregate_basic_avx2.cc`, which, looking at the disassembly for MinMax is
filled with instructions that decidedly don't use AVX2 for MinMax. I can't
really tell what it's doing).
Compilers are plenty smart already for most of the scalar kernels we would
want to add, we just don't use it. And we don't really use xsimd for
anything currently except for "bpacking", which, looking at godbolt, gets
autovectorized just fine.

As for code that isn't autovectorizable, I'm going to have to hard disagree
that an abstraction library is the best way. The whole point of manually
writing tuning code is to extract maximum performance from the hardware
that you're targeting. This can only happen if you develop your algorithms
for the specific instruction set. Using something like xsimd will hide the
instructions that are actually being invoked. It may be that there is a
native instruction for something on x86 but not on NEON, and vice versa. By
using xsimd, it would appear to work well on both architectures but you'd
be shooting yourself in the foot by emulating one architecture's simd with
another's. In fact, I don't see any equivalent in xsimd for AVX's swizzle
instructions on NEON. These instructions are heavily used by AVX
algorithms. The point is: you design the algorithm around the instruction
set, not the other way around.

So in summary: As far as I can tell, we currently don't do any
autovectorization, and we use xsimd in a spot where autovectorization would
do the same job, but without the added dependency.

Sasha

On Tue, Mar 29, 2022 at 6:47 PM Yibo Cai <yibo....@arm.com> wrote:

> Hi Sasha,
>
> Thanks for the advice. I didn't quite catch the point. Would you explain a
> bit the purpose of this proposal?
>
> We do prefer compiler auto-vectorization to explicit simd code, even if
> the c++ code is slower than simd one (20% is acceptable IMO). And we do
> support runtime dispatch kernels based on target machine arch.
>
> Then what is left to talk is how to deal with codes that are not
> auto-vectorizable but can be manually optimized with simd instructions.
> Looks your proposal is to do nothing more than adding appropriate compiler
> flags and wait for compilers become smarter in the future. I think this is
> a reasonable approach, probably is many cases. But if we do want to
> manually tune the code, I believe a simd library is the best way.
>
> To me there's no "replacing" between xsimd and auto-vectorization, they
> just do their own jobs.
>
> Yibo
>
> -----Original Message-----
> From: Sasha Krassovsky <krassovskysa...@gmail.com>
> Sent: Wednesday, March 30, 2022 6:58 AM
> To: dev@arrow.apache.org; emkornfi...@gmail.com
> Subject: Re: [C++] Replacing xsimd with compiler autovectorization
>
> xsimd has three problems I can think of right now:
> 1) xsimd code looks like normal simd code: you have to explicitly do loads
> and stores, you have to explicitly unroll and stride through your loop, and
> you have to explicitly process the tail of the loop. This makes writing a
> large number of kernels extremely tedious and error-prone. In comparison,
> writing a single three-line scalar for loop is easier to both read and
> write.
> 2) xsimd limits the freedom an optimizer has to select instructions and do
> other optimizations, as it's just a thin wrapper over normal intrinsics.
> One concrete example would be if we wanted to take advantage of the
> dynamic dispatch instruction set xsimd offers, the loop strides would no
> longer be compile-time constants, which might prevent the compiler from
> performing loop unrolling (how would it know that the stride isn't just 1?)
> 3) Lastly, if we ever want to support a new architecture (like Power9 or
> RISC-V), we'd have to wait for an xsimd backend to become available. On the
> other hand, if SiFive came out with a hot new chip supporting RV64V, all
> we'd have to do to support it is to add the appropriate compiler flag into
> the CMakeLists.
>
> As for using an external build system, I'm not sure how much complexity it
> would add, but at the very least I suspect it would work out of the box if
> you only wanted to support scalar kernels. Otherwise I don't think it would
> add much more complexity than we currently have detecting architectures at
> buildtime.
>
> Sasha
>
> On Tue, Mar 29, 2022 at 3:26 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Sasha,
> > Could you elaborate on the problems of the XSIMD dependency?  What you
> > describe sounds a lot like what XSIMD provides in a prepackaged form
> > and without the extra CMake magic.
> >
> > I  have to occasionally build Arrow with an external build system and
> > it sounds like this type of logic could add complexity there.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky <
> > krassovskysa...@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > > I've noticed that we include xsimd as an abstraction over all of the
> > > simd architectures. I'd like to propose a different solution which
> > > would
> > result
> > > in fewer lines of code, while being more readable.
> > >
> > > My thinking is that anything simple enough to abstract with xsimd
> > > can be autovectorized by the compiler. Any more interesting SIMD
> > > algorithm
> > usually
> > > is tailored to the target instruction set and can't be abstracted
> > > away
> > with
> > > xsimd anyway.
> > >
> > > With that in mind, I'd like to propose the following strategy:
> > > 1. Write a single source file with simple, element-at-a-time for
> > > loop implementations of each function.
> > > 2. Compile this same source file several times with different
> > > compile
> > flags
> > > for different vectorization (e.g. if we're on an x86 machine that
> > supports
> > > AVX2 and AVX512, we'd compile once with -mavx2 and once with
> -mavx512vl).
> > > 3. Functions compiled with different instruction sets can be
> > differentiated
> > > by a namespace, which gets defined during the compiler invocation.
> > > For example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2
> > > and then for something like elementwise addition of two arrays, we'd
> > > call arrow::AVX2::VectorAdd.
> > >
> > > I believe this would let us remove xsimd as a dependency while also
> > giving
> > > us lots of vectorized kernels at the cost of some extra cmake magic.
> > After
> > > that, it would just be a matter of making the function registry
> > > point to these new functions.
> > >
> > > Please let me know your thoughts!
> > >
> > > Thanks,
> > > Sasha Krassovsky
> > >
> >
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>

Reply via email to