https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116128
anlauf at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P5 --- Comment #2 from anlauf at gcc dot gnu.org --- Can you provide testcases for discussion? The library versions have to deal with different situations: - non-unit strides. Vectorization on many architectures only works for unit stride, or for those processors with support for gather/(scatter). - minval/maxval need to deal with NaNs etc. for proper IEEE support. One could try different paths for better vectorization: (1) add runtime library versions/code paths for unit stride (2) generate inline code instead of calling the runtime library (3) create avx2/... versions of the runtime library code (this was done for matmul so far). Among these options, (2) is probably the hardest one. Option (1) would allow auto-vectorization by the compiler, while (3) looks like a natural but manual solution for x86. > makes me think that the optimisations of omp simd reduce(+) would be > permitted. omp simd is something which could be tried for 'sum', but to get full performance needs a rewrite of the related library code. > The same comment applies to dot_product, and probably the other intrinsic > reduction operations. dot_product is inline-expanded.