> Is this in general, or specific to compiler flags being used. In general. As far as I'm aware, autovectorization will only happen if you have some sort of simd enabled, and I don't see that anywhere among the flags except for this specific cases.
> If these aren't due to SIMD differences, I'm kind of curious what they are from. My theory is that this is due to loop unrolling, but comparing it to the assembly generated by a simple for loop I think there's a lot of performance left on the table. I can look more closely at the generated code between scalar and AVX2 versions tomorrow if anyone would find that interesting (I do :) ). > would this include dynamic dispatch for all kernels as well? Yes, it would. As an example, let's take adding two arrays together. We would have a file like so (let's call it vector_add.cpp) namespace ARCH { void VectorAdd(size_t n, int64_t *a, int64_t *b, int64_t *out) { for(size_t i = 0; i < n; i++) out[i] = a[i] + b[i]; } } We would also have a header file vector_add.h void VectorAdd(size_t n, int64_t *a, int64_t *b, int64_t *out); Then we would compile it several times like so (obviously depending on which architectures we'd want to enable): g++ vector_add.cpp -O3 -mavx2 -DARCH=AVX2 -o vector_add_avx2.o g++ vector_add.cpp -O3 -mavx512vl -DARCH=AVX512 -o vector_add_avx512.o g++ vector_add.cpp -O3 -mfpu=neon -DARCH=NEON -o vector_add_neon.o g++ vector_add.cpp -O3 -o vector_add_scalar.o Now our global VectorAdd that would perform the dynamic dispatch would check the SIMD level and call the appropriate function. Note that we have to forward declare each of these functions, which we can do like so: namespace AVX2 { #include "vector_add.h" } namespace AVX512 { #include "vector_add.h" } namespace NEON { #include "vector_add.h" } namespace SCALAR { #include "vector_add.h" } void VectorAdd(SimdLevel simd_level, int64_t *a, int64_t *b, int64_t *out) { switch(simd_level) { case AVX2: return AVX2::VectorAdd(n, a, b, out); case AVX512: return AVX512::VectorAdd(n, a, b, out); case NEON: return NEON::VectorAdd(n, a, b, out); default: return SCALAR::VectorAdd(n, a, b, out); } } The idea is to have the same source file be compiled into several different compilation units. Obviously this can be adapted more to match the current infrastructure and to support more types (like we definitely would template all of these simple functions on type). This idea could also be refactored to use fewer lines of code, such as by having each of the implementations be in an array of function pointers indexed by simd level. These details can be left for later. > Last time we discussed this I believe we agreed we'd use and build up Arrow's dynamic dispatch instead of using XSIMDs. If that's the case, then you'd have to do something like this anyway. Compiler will refuse to emit the instructions corresponding to the intrinsics that xsimd generates if you don't enable the proper compiler flags. > However, this raises the issue about how fragile auto-vectorization can be at times (and also based on compilers). This is a fiar point. I guess in such a scenario xsimd could be useful. I wouldn't mind too much if it were used sparingly (i.e. in cases where we showed the compiler did a bad job). For the kernels I'm thinking of, such as element-wise addition, multiplication, comparison etc, I can almost guarantee that any compiler worth its salt would autovectorize it given the chance to. > if Arrow builds become too difficult to adapt to other build-systems because of CMake magic, I don't think that is a good outcome. I think "cmake magic" might've been the wrong word choice, it definitely sounds much scarier than what I'm actually proposing 😅 It's just invoking compiler several times on the same source file with different flags. Sasha On Tue, Mar 29, 2022 at 9:04 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > So in summary: As far as I can tell, we currently don't do any > > autovectorization, and we use xsimd in a spot where autovectorization > would > > do the same job, but without the added dependency. > > Is this in general, or specific to compiler flags being used. As a pont of > reference the original PR that did optimiziation for aggregates [1] only > shows results when AVX2 is required to see performance speedups, it is > possible there has been a regression since its introduction. If these > aren't due to SIMD differences, I'm kind of curious what they are from. > One of the challenges is that pre-released binaries need to support a wide > variety of ISAs so acceleration would require dynamic dispatch. I'm not > sure I understand your proposal exactly but would this include dynamic > dispatch for all kernels as well? Maybe an example of your proposal for > one kernel would help the discussion? > > 2) xsimd limits the freedom an optimizer has to select instructions and do > > other optimizations, as it's just a thin wrapper over normal intrinsics. > > One concrete example would be if we wanted to take advantage of the > dynamic > > dispatch instruction set xsimd offers, the loop strides would no longer > be > > compile-time constants, which might prevent the compiler from performing > > loop unrolling (how would it know that the stride isn't just 1?) > > > Last time we discussed this I believe we agreed we'd use and build up > Arrow's dynamic dispatch instead of using XSIMDs. > > The whole point of manually > > writing tuning code is to extract maximum performance from the hardware > > that you're targeting. > > I think there are different levels to this. For most optimal performance, > yes you need to target an ISA, it also is the least generalizable. I > believe we've seen examples where using XSIMD has improved both Intel and > Arm performance over whatever the compiler was generating. This could be > due to using a pattern that was auto-vectorization friendly and maybe there > are better ones. However, this raises the issue about how fragile > auto-vectorization can be at times (and also based on compilers). > > For the record, I don't have a strong opinion on removing or keeping XSIMD, > but if Arrow builds become too difficult to adapt to other build-systems > because of CMake magic, I don't think that is a good outcome. > > Cheers, > Micah > > [1] https://github.com/apache/arrow/pull/7871 > > > > > > > > On Tue, Mar 29, 2022 at 8:05 PM Sasha Krassovsky < > krassovskysa...@gmail.com> > wrote: > > > As far as I can tell, we currently have no simd for any of our kernels. > The > > only source files compiled with e.g. AVX2 support are ones having to do > > with the hash table used for group by and join (and one called > > `aggregate_basic_avx2.cc`, which, looking at the disassembly for MinMax > is > > filled with instructions that decidedly don't use AVX2 for MinMax. I > can't > > really tell what it's doing). > > Compilers are plenty smart already for most of the scalar kernels we > would > > want to add, we just don't use it. And we don't really use xsimd for > > anything currently except for "bpacking", which, looking at godbolt, gets > > autovectorized just fine. > > > > As for code that isn't autovectorizable, I'm going to have to hard > disagree > > that an abstraction library is the best way. The whole point of manually > > writing tuning code is to extract maximum performance from the hardware > > that you're targeting. This can only happen if you develop your > algorithms > > for the specific instruction set. Using something like xsimd will hide > the > > instructions that are actually being invoked. It may be that there is a > > native instruction for something on x86 but not on NEON, and vice versa. > By > > using xsimd, it would appear to work well on both architectures but you'd > > be shooting yourself in the foot by emulating one architecture's simd > with > > another's. In fact, I don't see any equivalent in xsimd for AVX's swizzle > > instructions on NEON. These instructions are heavily used by AVX > > algorithms. The point is: you design the algorithm around the instruction > > set, not the other way around. > > > > So in summary: As far as I can tell, we currently don't do any > > autovectorization, and we use xsimd in a spot where autovectorization > would > > do the same job, but without the added dependency. > > > > Sasha > > > > On Tue, Mar 29, 2022 at 6:47 PM Yibo Cai <yibo....@arm.com> wrote: > > > > > Hi Sasha, > > > > > > Thanks for the advice. I didn't quite catch the point. Would you > explain > > a > > > bit the purpose of this proposal? > > > > > > We do prefer compiler auto-vectorization to explicit simd code, even if > > > the c++ code is slower than simd one (20% is acceptable IMO). And we do > > > support runtime dispatch kernels based on target machine arch. > > > > > > Then what is left to talk is how to deal with codes that are not > > > auto-vectorizable but can be manually optimized with simd instructions. > > > Looks your proposal is to do nothing more than adding appropriate > > compiler > > > flags and wait for compilers become smarter in the future. I think this > > is > > > a reasonable approach, probably is many cases. But if we do want to > > > manually tune the code, I believe a simd library is the best way. > > > > > > To me there's no "replacing" between xsimd and auto-vectorization, they > > > just do their own jobs. > > > > > > Yibo > > > > > > -----Original Message----- > > > From: Sasha Krassovsky <krassovskysa...@gmail.com> > > > Sent: Wednesday, March 30, 2022 6:58 AM > > > To: dev@arrow.apache.org; emkornfi...@gmail.com > > > Subject: Re: [C++] Replacing xsimd with compiler autovectorization > > > > > > xsimd has three problems I can think of right now: > > > 1) xsimd code looks like normal simd code: you have to explicitly do > > loads > > > and stores, you have to explicitly unroll and stride through your loop, > > and > > > you have to explicitly process the tail of the loop. This makes > writing a > > > large number of kernels extremely tedious and error-prone. In > comparison, > > > writing a single three-line scalar for loop is easier to both read and > > > write. > > > 2) xsimd limits the freedom an optimizer has to select instructions and > > do > > > other optimizations, as it's just a thin wrapper over normal > intrinsics. > > > One concrete example would be if we wanted to take advantage of the > > > dynamic dispatch instruction set xsimd offers, the loop strides would > no > > > longer be compile-time constants, which might prevent the compiler from > > > performing loop unrolling (how would it know that the stride isn't just > > 1?) > > > 3) Lastly, if we ever want to support a new architecture (like Power9 > or > > > RISC-V), we'd have to wait for an xsimd backend to become available. On > > the > > > other hand, if SiFive came out with a hot new chip supporting RV64V, > all > > > we'd have to do to support it is to add the appropriate compiler flag > > into > > > the CMakeLists. > > > > > > As for using an external build system, I'm not sure how much complexity > > it > > > would add, but at the very least I suspect it would work out of the box > > if > > > you only wanted to support scalar kernels. Otherwise I don't think it > > would > > > add much more complexity than we currently have detecting architectures > > at > > > buildtime. > > > > > > Sasha > > > > > > On Tue, Mar 29, 2022 at 3:26 PM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > > > Hi Sasha, > > > > Could you elaborate on the problems of the XSIMD dependency? What > you > > > > describe sounds a lot like what XSIMD provides in a prepackaged form > > > > and without the extra CMake magic. > > > > > > > > I have to occasionally build Arrow with an external build system and > > > > it sounds like this type of logic could add complexity there. > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky < > > > > krassovskysa...@gmail.com> > > > > wrote: > > > > > > > > > Hi everyone, > > > > > I've noticed that we include xsimd as an abstraction over all of > the > > > > > simd architectures. I'd like to propose a different solution which > > > > > would > > > > result > > > > > in fewer lines of code, while being more readable. > > > > > > > > > > My thinking is that anything simple enough to abstract with xsimd > > > > > can be autovectorized by the compiler. Any more interesting SIMD > > > > > algorithm > > > > usually > > > > > is tailored to the target instruction set and can't be abstracted > > > > > away > > > > with > > > > > xsimd anyway. > > > > > > > > > > With that in mind, I'd like to propose the following strategy: > > > > > 1. Write a single source file with simple, element-at-a-time for > > > > > loop implementations of each function. > > > > > 2. Compile this same source file several times with different > > > > > compile > > > > flags > > > > > for different vectorization (e.g. if we're on an x86 machine that > > > > supports > > > > > AVX2 and AVX512, we'd compile once with -mavx2 and once with > > > -mavx512vl). > > > > > 3. Functions compiled with different instruction sets can be > > > > differentiated > > > > > by a namespace, which gets defined during the compiler invocation. > > > > > For example, for AVX2 we'd invoke the compiler with > -DNAMESPACE=AVX2 > > > > > and then for something like elementwise addition of two arrays, > we'd > > > > > call arrow::AVX2::VectorAdd. > > > > > > > > > > I believe this would let us remove xsimd as a dependency while also > > > > giving > > > > > us lots of vectorized kernels at the cost of some extra cmake > magic. > > > > After > > > > > that, it would just be a matter of making the function registry > > > > > point to these new functions. > > > > > > > > > > Please let me know your thoughts! > > > > > > > > > > Thanks, > > > > > Sasha Krassovsky > > > > > > > > > > > > IMPORTANT NOTICE: The contents of this email and any attachments are > > > confidential and may also be privileged. If you are not the intended > > > recipient, please notify the sender immediately and do not disclose the > > > contents to any other person, use it for any purpose, or store or copy > > the > > > information in any medium. Thank you. > > > > > >