Just want to give some updates on the dispatching.
Now we has workable runtime functionality include dispatch mechanism[1][2] and
build framework for both the compute kernels and other parts of C++. There are
some remaining SIMD static complier code under the code base that I will try to
work later to convert it to runtime path.
The last issue I see is the CI part, it has an environment variant:
ARROW_RUNTIME_SIMD_LEVEL[3] already can be leveraged to perform the SIMD level
test, but we lack a CI device which always support AVX512. I did some
factitious test to check which CI machine has AVX512 capacity and find below 4
tasks indeed capable, but unluckily it's not always 100%, something around
70%~80% chance it's scheduled to a AVX512 device.
C++ / AMD64 Windows 2019 C++
Python / AMD64 Conda Python 3.6 Pandas latest
Python / AMD64 Conda Python 3.6 Pandas 0.23
C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN
I plan to add SIMD test task with AVX512/AVX2/SSE4_2/NONE level on " C++ /
AMD64 Ubuntu 18.04 C++ ASAN UBSAN" and " C++ / AMD64 Windows 2019 C++" though
it's not always scheduled to machine with AVX512, any idea or thoughts?
[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/dispatch.h
[2]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L561
[3]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_info.cc#L451
Thanks,
Frank
-----Original Message-----
From: Wes McKinney <[email protected]>
Sent: Wednesday, May 13, 2020 9:39 PM
To: dev <[email protected]>; Micah Kornfield <[email protected]>
Subject: Re: [C++] Runtime SIMD dispatching for Arrow
On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <[email protected]> wrote:
>
> >
> > Since I develop on an AVX512-capable machine, if we have runtime
> > dispatching then it should be able to test all variants of a
> > function from a single executable / test run rather than having to
> > produce multiple builds and test them separately, right?
>
> Yes, but I think the same of true without runtime dispatching. We
> might have different mental models for runtime dispatching so I'll put
> up a concrete example. If we want optimized code for "some_function"
> it would like like:
>
> #ifdef HAVE_AVX512
> void some_function_512() {
> ...
> }
> #endif
>
> void some_function_base() {
> ...
> }
>
> // static dispatching
> void some_function() {
> #ifdef HAVE_AVX512
> some_function_512();
> #else
> some_function_base();
> #endif
> }
>
> // dynamic dispatch
> void some_function() {
> static void()* chosen_function = Choose(cpu_info,
> &some_function_512, &some_function_base);
> *chosen_function();
> }
>
> In both cases, we need to have a tests which call into
> some_function_512() and some_function_base(). It is possible with
> runtime dispatching we can write code in tests as something like:
>
> for (CpuInfo info : all_supported_architectures) {
> TEST(Choose(info, &some_function_512, &some_function_base)); }
>
> But I think there is likely something equivalent that we could to do
> with macro magic.
That's one way. Or it could have a default configuration set external to the
binary, similar to things like OMP_NUM_THREADS
ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest
Either way it seems like a good idea to the number of #ifdef's in the codebase
and reduce the need to recompile
> Did you have something different in mind?
>
> Micah
>
>
>
>
>
> On Tue, May 12, 2020 at 8:31 PM Wes McKinney <[email protected]> wrote:
>
> > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <[email protected]> wrote:
> > >
> > > Thanks Wes, I'm glad to see this feature coming.
> > >
> > > From history talks, the main concern is runtime dispatcher may
> > > cause
> > performance issue.
> > > Personally, I don't think it's a big problem. If we're using SIMD,
> > > it
> > must be targeting some time consuming code.
> > >
> > > But we do need to take care some issues. E.g, I see code like this:
> > > for (int i = 0; i < n; ++i) {
> > > simd_code();
> > > }
> > > With runtime dispatcher, it becomes an indirect function call in
> > > each
> > iteration.
> > > We should change the code to move the loop inside simd_code().
> >
> > To be clear, I'm referring to SIMD-optimized code that operates on
> > batches of data. The overhead of choosing an implementation based on
> > a global settings object should not be meaningful. If there is
> > performance-sensitive code at inline call sites then I agree that it
> > is an issue. I don't think that characterizes most of the
> > anticipated work in Arrow, though, since functions generally will
> > process a chunk/array of data at time (see, e.g. Parquet
> > encoding/decoding work recently).
> >
> > > It would be better if you can consider architectures other than
> > > x86(at
> > framework level).
> > > Ignore it if it costs much effort. We can always improve later.
> > >
> > > Yibo
> > >
> > > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > > hi,
> > > >
> > > > We've started to receive a number of patches providing SIMD
> > > > operations for both x86 and ARM architectures. Most of these
> > > > patches make use of compiler definitions to toggle between code paths
> > > > at compile time.
> > > >
> > > > This is problematic for a few reasons:
> > > >
> > > > * Binaries that are shipped (e.g. in Python) must generally be
> > > > compiled for a broad set of supported compilers. That means that
> > > > AVX2 / AVX512 optimizations won't be available in these builds
> > > > for processors that have them
> > > > * Poses a maintainability and testing problem (hard to test
> > > > every combination, and it is not practical for local development
> > > > to compile every combination, which may cause drawn out
> > > > test/CI/fix cycles)
> > > >
> > > > Other projects (e.g. NumPy) have taken the approach of building
> > > > binaries that contain multiple variants of a function with
> > > > different levels of SIMD, and then choosing at runtime which one
> > > > to execute based on what features the CPU supports. This seems
> > > > like what we ultimately need to do in Apache Arrow, and if we
> > > > continue to accept patches that do not do this, it will be much
> > > > more work later when we have to refactor things to runtime dispatching.
> > > >
> > > > We have some PRs in the queue related to SIMD. Without taking a
> > > > heavy handed approach like starting to veto PRs, how would
> > > > everyone like to begin to address the runtime dispatching problem?
> > > >
> > > > Note that the Kernels revamp project I am working on right now
> > > > will also facilitate runtime SIMD kernel dispatching for array
> > > > expression evaluation.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> >