On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > > Since I develop on an AVX512-capable machine, if we have runtime > > dispatching then it should be able to test all variants of a function > > from a single executable / test run rather than having to produce > > multiple builds and test them separately, right? > > Yes, but I think the same of true without runtime dispatching. We might > have different mental models for runtime dispatching so I'll put up a > concrete example. If we want optimized code for "some_function" it would > like like: > > #ifdef HAVE_AVX512 > void some_function_512() { > ... > } > #endif > > void some_function_base() { > ... > } > > // static dispatching > void some_function() { > #ifdef HAVE_AVX512 > some_function_512(); > #else > some_function_base(); > #endif > } > > // dynamic dispatch > void some_function() { > static void()* chosen_function = Choose(cpu_info, &some_function_512, > &some_function_base); > *chosen_function(); > } > > In both cases, we need to have a tests which call into some_function_512() > and some_function_base(). It is possible with runtime dispatching we can > write code in tests as something like: > > for (CpuInfo info : all_supported_architectures) { > TEST(Choose(info, &some_function_512, &some_function_base)); > } > > But I think there is likely something equivalent that we could to do with > macro magic.
That's one way. Or it could have a default configuration set external to the binary, similar to things like OMP_NUM_THREADS ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest Either way it seems like a good idea to the number of #ifdef's in the codebase and reduce the need to recompile > Did you have something different in mind? > > Micah > > > > > > On Tue, May 12, 2020 at 8:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yibo....@arm.com> wrote: > > > > > > Thanks Wes, I'm glad to see this feature coming. > > > > > > From history talks, the main concern is runtime dispatcher may cause > > performance issue. > > > Personally, I don't think it's a big problem. If we're using SIMD, it > > must be targeting some time consuming code. > > > > > > But we do need to take care some issues. E.g, I see code like this: > > > for (int i = 0; i < n; ++i) { > > > simd_code(); > > > } > > > With runtime dispatcher, it becomes an indirect function call in each > > iteration. > > > We should change the code to move the loop inside simd_code(). > > > > To be clear, I'm referring to SIMD-optimized code that operates on > > batches of data. The overhead of choosing an implementation based on a > > global settings object should not be meaningful. If there is > > performance-sensitive code at inline call sites then I agree that it > > is an issue. I don't think that characterizes most of the anticipated > > work in Arrow, though, since functions generally will process a > > chunk/array of data at time (see, e.g. Parquet encoding/decoding work > > recently). > > > > > It would be better if you can consider architectures other than x86(at > > framework level). > > > Ignore it if it costs much effort. We can always improve later. > > > > > > Yibo > > > > > > On 5/13/20 9:46 AM, Wes McKinney wrote: > > > > hi, > > > > > > > > We've started to receive a number of patches providing SIMD operations > > > > for both x86 and ARM architectures. Most of these patches make use of > > > > compiler definitions to toggle between code paths at compile time. > > > > > > > > This is problematic for a few reasons: > > > > > > > > * Binaries that are shipped (e.g. in Python) must generally be > > > > compiled for a broad set of supported compilers. That means that AVX2 > > > > / AVX512 optimizations won't be available in these builds for > > > > processors that have them > > > > * Poses a maintainability and testing problem (hard to test every > > > > combination, and it is not practical for local development to compile > > > > every combination, which may cause drawn out test/CI/fix cycles) > > > > > > > > Other projects (e.g. NumPy) have taken the approach of building > > > > binaries that contain multiple variants of a function with different > > > > levels of SIMD, and then choosing at runtime which one to execute > > > > based on what features the CPU supports. This seems like what we > > > > ultimately need to do in Apache Arrow, and if we continue to accept > > > > patches that do not do this, it will be much more work later when we > > > > have to refactor things to runtime dispatching. > > > > > > > > We have some PRs in the queue related to SIMD. Without taking a heavy > > > > handed approach like starting to veto PRs, how would everyone like to > > > > begin to address the runtime dispatching problem? > > > > > > > > Note that the Kernels revamp project I am working on right now will > > > > also facilitate runtime SIMD kernel dispatching for array expression > > > > evaluation. > > > > > > > > Thanks, > > > > Wes > > > > > >