I meant to ask a while ago about vectorized reductions after I saw a paper that I can't now find. I didn't understand what was behind it.
Can someone explain why you need to hand-code the avx implementations of the reduction operations now used on x86_64? As far as I remember, the paper didn't justify the effort past alluding to a compiler being unable to vectorize reductions. I wonder which compiler(s); the recent ones I'm familiar with certainly can if you allow them (or don't stop them -- icc, sigh). I've been assured before that GCC can't, but that's probably due to using the default correct FP compilation and/or not restricting function arguments. So I wonder what's the problem just using C and a tolerably recent GCC if necessary -- is there something else behind this? Since only x86 is supported, I had a go on ppc64le and with minimal effort saw GCC vectorizing more of the base implementation functions than are included in the avx version. Similarly for x86 micro-architectures. (I'd need convincing that avx512 is worth the frequency reduction.) It would doubtless be the same on aarch64, say, but I only have the POWER. Thanks for any info.