Re: [OMPI users] vectorized reductions

Gilles Gouaillardet via users Tue, 20 Jul 2021 17:10:20 -0700

You are welcome to provide any data that evidences the currentimplementation

(intrinsics, AVX512) is not the most efficient, and you are free toissue a Pull Request


in order to suggest a better one.


The op/avx component has pretty much nothing to do with scalability:

only one node is required to measure the performance, and the

test/datatype/reduce_local test can be used as a measurement.

/* several core counts should be used in order to fully evaluate theinfamous AVX512 frequency downscaling */

The benefits of ap/avx (including AVX512) have been reported, forexample athttps://github.com/open-mpi/ompi/issues/8334#issuecomment-759864154



FWIW, George added SVE support in https://github.com/bosilca/ompi/pull/14,

and I added support for NEON and SVE inhttps://github.com/ggouaillardet/ompi/tree/topic/op_arm

None of these have been merged, but you are free to evaluate them andreport the performance numbers.




On 7/20/2021 11:00 PM, Dave Love via users wrote:

Gilles Gouaillardet via users <users@lists.open-mpi.org> writes:

One motivation is packaging: a single Open MPI implementation has to be
built, that can run on older x86 processors (supporting only SSE) and the
latest ones (supporting AVX512).

I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.

The op/avx component will select at
runtime the most efficient implementation for vectorized reductions.

It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.

Re: [OMPI users] vectorized reductions

Reply via email to