Yes , best to has dedicated AVX512 device. Great news that you are working on the machine😊
Thanks, Frank -----Original Message----- From: Wes McKinney <wesmck...@gmail.com> Sent: Monday, September 7, 2020 12:41 AM To: dev <dev@arrow.apache.org> Subject: Re: [C++] Runtime SIMD dispatching for Arrow I might be able to contribute an AVX-512 capable machine for testing / benchmarking via Buildkite or similar in the next 6 months. It seems like dedicated hardware would be the best approach to get consistency there. If someone else would be able to contribute a reliable machine that would also be useful to know. On Thu, Sep 3, 2020 at 10:29 PM Du, Frank <frank...@intel.com> wrote: > > Just want to give some updates on the dispatching. > > Now we has workable runtime functionality include dispatch mechanism[1][2] > and build framework for both the compute kernels and other parts of C++. > There are some remaining SIMD static complier code under the code base that I > will try to work later to convert it to runtime path. > > The last issue I see is the CI part, it has an environment variant: > ARROW_RUNTIME_SIMD_LEVEL[3] already can be leveraged to perform the SIMD > level test, but we lack a CI device which always support AVX512. I did some > factitious test to check which CI machine has AVX512 capacity and find below > 4 tasks indeed capable, but unluckily it's not always 100%, something around > 70%~80% chance it's scheduled to a AVX512 device. > C++ / AMD64 Windows 2019 C++ > Python / AMD64 Conda Python 3.6 Pandas latest > Python / AMD64 Conda Python 3.6 Pandas 0.23 > C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN I plan to add SIMD > test task with AVX512/AVX2/SSE4_2/NONE level on " C++ / AMD64 Ubuntu 18.04 > C++ ASAN UBSAN" and " C++ / AMD64 Windows 2019 C++" though it's not always > scheduled to machine with AVX512, any idea or thoughts? > > [1] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/dispatc > h.h [2] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kern > el.h#L561 [3] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_inf > o.cc#L451 > > Thanks, > Frank > > -----Original Message----- > From: Wes McKinney <wesmck...@gmail.com> > Sent: Wednesday, May 13, 2020 9:39 PM > To: dev <dev@arrow.apache.org>; Micah Kornfield > <emkornfi...@gmail.com> > Subject: Re: [C++] Runtime SIMD dispatching for Arrow > > On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > > > Since I develop on an AVX512-capable machine, if we have runtime > > > dispatching then it should be able to test all variants of a > > > function from a single executable / test run rather than having to > > > produce multiple builds and test them separately, right? > > > > Yes, but I think the same of true without runtime dispatching. We > > might have different mental models for runtime dispatching so I'll > > put up a concrete example. If we want optimized code for "some_function" > > it would like like: > > > > #ifdef HAVE_AVX512 > > void some_function_512() { > > ... > > } > > #endif > > > > void some_function_base() { > > ... > > } > > > > // static dispatching > > void some_function() { > > #ifdef HAVE_AVX512 > > some_function_512(); > > #else > > some_function_base(); > > #endif > > } > > > > // dynamic dispatch > > void some_function() { > > static void()* chosen_function = Choose(cpu_info, > > &some_function_512, &some_function_base); > > *chosen_function(); > > } > > > > In both cases, we need to have a tests which call into > > some_function_512() and some_function_base(). It is possible with > > runtime dispatching we can write code in tests as something like: > > > > for (CpuInfo info : all_supported_architectures) { > > TEST(Choose(info, &some_function_512, &some_function_base)); } > > > > But I think there is likely something equivalent that we could to do > > with macro magic. > > That's one way. Or it could have a default configuration set external > to the binary, similar to things like OMP_NUM_THREADS > > ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest > ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest > ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest > ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest > > Either way it seems like a good idea to the number of #ifdef's in the > codebase and reduce the need to recompile > > > Did you have something different in mind? > > > > Micah > > > > > > > > > > > > On Tue, May 12, 2020 at 8:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yibo....@arm.com> wrote: > > > > > > > > Thanks Wes, I'm glad to see this feature coming. > > > > > > > > From history talks, the main concern is runtime dispatcher may > > > > cause > > > performance issue. > > > > Personally, I don't think it's a big problem. If we're using > > > > SIMD, it > > > must be targeting some time consuming code. > > > > > > > > But we do need to take care some issues. E.g, I see code like this: > > > > for (int i = 0; i < n; ++i) { > > > > simd_code(); > > > > } > > > > With runtime dispatcher, it becomes an indirect function call in > > > > each > > > iteration. > > > > We should change the code to move the loop inside simd_code(). > > > > > > To be clear, I'm referring to SIMD-optimized code that operates on > > > batches of data. The overhead of choosing an implementation based > > > on a global settings object should not be meaningful. If there is > > > performance-sensitive code at inline call sites then I agree that > > > it is an issue. I don't think that characterizes most of the > > > anticipated work in Arrow, though, since functions generally will > > > process a chunk/array of data at time (see, e.g. Parquet > > > encoding/decoding work recently). > > > > > > > It would be better if you can consider architectures other than > > > > x86(at > > > framework level). > > > > Ignore it if it costs much effort. We can always improve later. > > > > > > > > Yibo > > > > > > > > On 5/13/20 9:46 AM, Wes McKinney wrote: > > > > > hi, > > > > > > > > > > We've started to receive a number of patches providing SIMD > > > > > operations for both x86 and ARM architectures. Most of these > > > > > patches make use of compiler definitions to toggle between code paths > > > > > at compile time. > > > > > > > > > > This is problematic for a few reasons: > > > > > > > > > > * Binaries that are shipped (e.g. in Python) must generally be > > > > > compiled for a broad set of supported compilers. That means > > > > > that > > > > > AVX2 / AVX512 optimizations won't be available in these builds > > > > > for processors that have them > > > > > * Poses a maintainability and testing problem (hard to test > > > > > every combination, and it is not practical for local > > > > > development to compile every combination, which may cause > > > > > drawn out test/CI/fix cycles) > > > > > > > > > > Other projects (e.g. NumPy) have taken the approach of > > > > > building binaries that contain multiple variants of a function > > > > > with different levels of SIMD, and then choosing at runtime > > > > > which one to execute based on what features the CPU supports. > > > > > This seems like what we ultimately need to do in Apache Arrow, > > > > > and if we continue to accept patches that do not do this, it > > > > > will be much more work later when we have to refactor things to > > > > > runtime dispatching. > > > > > > > > > > We have some PRs in the queue related to SIMD. Without taking > > > > > a heavy handed approach like starting to veto PRs, how would > > > > > everyone like to begin to address the runtime dispatching problem? > > > > > > > > > > Note that the Kernels revamp project I am working on right now > > > > > will also facilitate runtime SIMD kernel dispatching for array > > > > > expression evaluation. > > > > > > > > > > Thanks, > > > > > Wes > > > > > > > >