You are welcome to provide any data that evidences the current implementation

(intrinsics, AVX512) is not the most efficient, and you are free to issue a Pull Request

in order to suggest a better one.


The op/avx component has pretty much nothing to do with scalability:

only one node is required to measure the performance, and the

test/datatype/reduce_local test can be used as a measurement.

/* several core counts should be used in order to fully evaluate the infamous AVX512 frequency downscaling */

The benefits of ap/avx (including AVX512) have been reported, for example at https://github.com/open-mpi/ompi/issues/8334#issuecomment-759864154


FWIW, George added SVE support in https://github.com/bosilca/ompi/pull/14,

and I added support for NEON and SVE in https://github.com/ggouaillardet/ompi/tree/topic/op_arm

None of these have been merged, but you are free to evaluate them and report the performance numbers.



On 7/20/2021 11:00 PM, Dave Love via users wrote:
Gilles Gouaillardet via users <users@lists.open-mpi.org> writes:

One motivation is packaging: a single Open MPI implementation has to be
built, that can run on older x86 processors (supporting only SSE) and the
latest ones (supporting AVX512).
I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.

The op/avx component will select at
runtime the most efficient implementation for vectorized reductions.
It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.

Reply via email to