https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128

anlauf at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P5

--- Comment #2 from anlauf at gcc dot gnu.org ---
Can you provide testcases for discussion?

The library versions have to deal with different situations:

- non-unit strides.  Vectorization on many architectures only works for
  unit stride, or for those processors with support for gather/(scatter).

- minval/maxval need to deal with NaNs etc. for proper IEEE support.

One could try different paths for better vectorization:

(1) add runtime library versions/code paths for unit stride

(2) generate inline code instead of calling the runtime library

(3) create avx2/... versions of the runtime library code (this was done for
    matmul so far).

Among these options, (2) is probably the hardest one.
Option (1) would allow auto-vectorization by the compiler,
while (3) looks like a natural but manual solution for x86.

> makes me think that the optimisations of omp simd reduce(+) would be 
> permitted.

omp simd is something which could be tried for 'sum', but to get full
performance needs a rewrite of the related library code.

> The same comment applies to dot_product, and probably the other intrinsic
> reduction operations.

dot_product is inline-expanded.

Reply via email to