One motivation is packaging: a single Open MPI implementation has to be built, that can run on older x86 processors (supporting only SSE) and the latest ones (supporting AVX512). The op/avx component will select at runtime the most efficient implementation for vectorized reductions.
On Mon, Jul 19, 2021 at 11:11 PM Dave Love via users < users@lists.open-mpi.org> wrote: > I meant to ask a while ago about vectorized reductions after I saw a > paper that I can't now find. I didn't understand what was behind it. > > Can someone explain why you need to hand-code the avx implementations of > the reduction operations now used on x86_64? As far as I remember, the > paper didn't justify the effort past alluding to a compiler being unable > to vectorize reductions. I wonder which compiler(s); the recent ones > I'm familiar with certainly can if you allow them (or don't stop them -- > icc, sigh). I've been assured before that GCC can't, but that's > probably due to using the default correct FP compilation and/or not > restricting function arguments. So I wonder what's the problem just > using C and a tolerably recent GCC if necessary -- is there something > else behind this? > > Since only x86 is supported, I had a go on ppc64le and with minimal > effort saw GCC vectorizing more of the base implementation functions > than are included in the avx version. Similarly for x86 > micro-architectures. (I'd need convincing that avx512 is worth the > frequency reduction.) It would doubtless be the same on aarch64, say, > but I only have the POWER. > > Thanks for any info. >