You are welcome to provide any data that evidences the current
implementation
(intrinsics, AVX512) is not the most efficient, and you are free to
issue a Pull Request
in order to suggest a better one.
The op/avx component has pretty much nothing to do with scalability:
only one node is req
Gilles Gouaillardet via users writes:
> One motivation is packaging: a single Open MPI implementation has to be
> built, that can run on older x86 processors (supporting only SSE) and the
> latest ones (supporting AVX512).
I take dispatch on micro-architecture for granted, but it doesn't
require
One motivation is packaging: a single Open MPI implementation has to be
built, that can run on older x86 processors (supporting only SSE) and the
latest ones (supporting AVX512). The op/avx component will select at
runtime the most efficient implementation for vectorized reductions.
On Mon, Jul 19
I meant to ask a while ago about vectorized reductions after I saw a
paper that I can't now find. I didn't understand what was behind it.
Can someone explain why you need to hand-code the avx implementations of
the reduction operations now used on x86_64? As far as I remember, the
paper didn't j