>
> So in summary: As far as I can tell, we currently don't do any
> autovectorization, and we use xsimd in a spot where autovectorization would
> do the same job, but without the added dependency.

Is this in general, or specific to compiler flags being used.  As a pont of
reference the original PR that did optimiziation for aggregates [1] only
shows results when AVX2 is required to see performance speedups, it is
possible there has been a regression since its introduction.  If these
aren't due to SIMD differences, I'm kind of curious what they are from.
One of the challenges is that pre-released binaries need to support a wide
variety of ISAs so acceleration would require dynamic dispatch.  I'm not
sure I understand your proposal exactly but would this include dynamic
dispatch for all kernels as well?  Maybe an example of your proposal for
one kernel would help the discussion?

2) xsimd limits the freedom an optimizer has to select instructions and do
> other optimizations, as it's just a thin wrapper over normal intrinsics.
> One concrete example would be if we wanted to take advantage of the dynamic
> dispatch instruction set xsimd offers, the loop strides would no longer be
> compile-time constants, which might prevent the compiler from performing
> loop unrolling (how would it know that the stride isn't just 1?)
>
Last time we discussed this I believe we agreed we'd use and build up
Arrow's dynamic dispatch instead of using XSIMDs.

The whole point of manually
> writing tuning code is to extract maximum performance from the hardware
> that you're targeting.

I think there are different levels to this. For most optimal performance,
yes you need to target an ISA, it also is the least generalizable.  I
believe we've seen examples where using XSIMD has improved both Intel and
Arm performance over whatever the compiler was generating. This could be
due to using a pattern that was auto-vectorization friendly and maybe there
are better ones.   However, this raises the issue about how fragile
auto-vectorization can be at times (and also based on compilers).

For the record, I don't have a strong opinion on removing or keeping XSIMD,
but if Arrow builds become too difficult to adapt to other build-systems
because of CMake magic, I don't think that is a good outcome.

Cheers,
Micah

[1] https://github.com/apache/arrow/pull/7871



>


On Tue, Mar 29, 2022 at 8:05 PM Sasha Krassovsky <krassovskysa...@gmail.com>
wrote:

> As far as I can tell, we currently have no simd for any of our kernels. The
> only source files compiled with e.g. AVX2 support are ones having to do
> with the hash table used for group by and join (and one called
> `aggregate_basic_avx2.cc`, which, looking at the disassembly for MinMax is
> filled with instructions that decidedly don't use AVX2 for MinMax. I can't
> really tell what it's doing).
> Compilers are plenty smart already for most of the scalar kernels we would
> want to add, we just don't use it. And we don't really use xsimd for
> anything currently except for "bpacking", which, looking at godbolt, gets
> autovectorized just fine.
>
> As for code that isn't autovectorizable, I'm going to have to hard disagree
> that an abstraction library is the best way. The whole point of manually
> writing tuning code is to extract maximum performance from the hardware
> that you're targeting. This can only happen if you develop your algorithms
> for the specific instruction set. Using something like xsimd will hide the
> instructions that are actually being invoked. It may be that there is a
> native instruction for something on x86 but not on NEON, and vice versa. By
> using xsimd, it would appear to work well on both architectures but you'd
> be shooting yourself in the foot by emulating one architecture's simd with
> another's. In fact, I don't see any equivalent in xsimd for AVX's swizzle
> instructions on NEON. These instructions are heavily used by AVX
> algorithms. The point is: you design the algorithm around the instruction
> set, not the other way around.
>
> So in summary: As far as I can tell, we currently don't do any
> autovectorization, and we use xsimd in a spot where autovectorization would
> do the same job, but without the added dependency.
>
> Sasha
>
> On Tue, Mar 29, 2022 at 6:47 PM Yibo Cai <yibo....@arm.com> wrote:
>
> > Hi Sasha,
> >
> > Thanks for the advice. I didn't quite catch the point. Would you explain
> a
> > bit the purpose of this proposal?
> >
> > We do prefer compiler auto-vectorization to explicit simd code, even if
> > the c++ code is slower than simd one (20% is acceptable IMO). And we do
> > support runtime dispatch kernels based on target machine arch.
> >
> > Then what is left to talk is how to deal with codes that are not
> > auto-vectorizable but can be manually optimized with simd instructions.
> > Looks your proposal is to do nothing more than adding appropriate
> compiler
> > flags and wait for compilers become smarter in the future. I think this
> is
> > a reasonable approach, probably is many cases. But if we do want to
> > manually tune the code, I believe a simd library is the best way.
> >
> > To me there's no "replacing" between xsimd and auto-vectorization, they
> > just do their own jobs.
> >
> > Yibo
> >
> > -----Original Message-----
> > From: Sasha Krassovsky <krassovskysa...@gmail.com>
> > Sent: Wednesday, March 30, 2022 6:58 AM
> > To: dev@arrow.apache.org; emkornfi...@gmail.com
> > Subject: Re: [C++] Replacing xsimd with compiler autovectorization
> >
> > xsimd has three problems I can think of right now:
> > 1) xsimd code looks like normal simd code: you have to explicitly do
> loads
> > and stores, you have to explicitly unroll and stride through your loop,
> and
> > you have to explicitly process the tail of the loop. This makes writing a
> > large number of kernels extremely tedious and error-prone. In comparison,
> > writing a single three-line scalar for loop is easier to both read and
> > write.
> > 2) xsimd limits the freedom an optimizer has to select instructions and
> do
> > other optimizations, as it's just a thin wrapper over normal intrinsics.
> > One concrete example would be if we wanted to take advantage of the
> > dynamic dispatch instruction set xsimd offers, the loop strides would no
> > longer be compile-time constants, which might prevent the compiler from
> > performing loop unrolling (how would it know that the stride isn't just
> 1?)
> > 3) Lastly, if we ever want to support a new architecture (like Power9 or
> > RISC-V), we'd have to wait for an xsimd backend to become available. On
> the
> > other hand, if SiFive came out with a hot new chip supporting RV64V, all
> > we'd have to do to support it is to add the appropriate compiler flag
> into
> > the CMakeLists.
> >
> > As for using an external build system, I'm not sure how much complexity
> it
> > would add, but at the very least I suspect it would work out of the box
> if
> > you only wanted to support scalar kernels. Otherwise I don't think it
> would
> > add much more complexity than we currently have detecting architectures
> at
> > buildtime.
> >
> > Sasha
> >
> > On Tue, Mar 29, 2022 at 3:26 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Sasha,
> > > Could you elaborate on the problems of the XSIMD dependency?  What you
> > > describe sounds a lot like what XSIMD provides in a prepackaged form
> > > and without the extra CMake magic.
> > >
> > > I  have to occasionally build Arrow with an external build system and
> > > it sounds like this type of logic could add complexity there.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky <
> > > krassovskysa...@gmail.com>
> > > wrote:
> > >
> > > > Hi everyone,
> > > > I've noticed that we include xsimd as an abstraction over all of the
> > > > simd architectures. I'd like to propose a different solution which
> > > > would
> > > result
> > > > in fewer lines of code, while being more readable.
> > > >
> > > > My thinking is that anything simple enough to abstract with xsimd
> > > > can be autovectorized by the compiler. Any more interesting SIMD
> > > > algorithm
> > > usually
> > > > is tailored to the target instruction set and can't be abstracted
> > > > away
> > > with
> > > > xsimd anyway.
> > > >
> > > > With that in mind, I'd like to propose the following strategy:
> > > > 1. Write a single source file with simple, element-at-a-time for
> > > > loop implementations of each function.
> > > > 2. Compile this same source file several times with different
> > > > compile
> > > flags
> > > > for different vectorization (e.g. if we're on an x86 machine that
> > > supports
> > > > AVX2 and AVX512, we'd compile once with -mavx2 and once with
> > -mavx512vl).
> > > > 3. Functions compiled with different instruction sets can be
> > > differentiated
> > > > by a namespace, which gets defined during the compiler invocation.
> > > > For example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2
> > > > and then for something like elementwise addition of two arrays, we'd
> > > > call arrow::AVX2::VectorAdd.
> > > >
> > > > I believe this would let us remove xsimd as a dependency while also
> > > giving
> > > > us lots of vectorized kernels at the cost of some extra cmake magic.
> > > After
> > > > that, it would just be a matter of making the function registry
> > > > point to these new functions.
> > > >
> > > > Please let me know your thoughts!
> > > >
> > > > Thanks,
> > > > Sasha Krassovsky
> > > >
> > >
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium. Thank you.
> >
>

Reply via email to