> It would be a very significant contributor, as the inconsistency can manifest 
> under the form of up to 8-fold differences in performance (or perhaps more).

This is on a micro benchmark. For a user workload, the kernel will account for 
maybe 20% of the runtime, so even if the kernel gets 10x faster the user 
workload will only be 18% faster (or in the ballpark, I didn’t math it 
rigorously). 

> There is concrete proof that autovectorization produces very flimsy results 
> (even on the same compiler, simply by varying the datatypes).

There is concrete proof of flimsy results for large template monsters, hidden 
behind layers of indirection across several source files. As I’ve shown, the 
Vector-Vector Add kernel example is consistently vectorized well across 
compilers if written in a simple way. Until I’ve seen a poorly-vectorized 
scalar kernel written as a simple for loop, I consider these arguments 
theoretical as well. 

> There is a far cry however, between the proposal of leveraging 
> autovectorization as a first step towards better performance

Yes, I did amend my proposal earlier in this thread, saying that leaving xsimd 
in and using it for kernels that don’t autovectorize well would work. 

It seems that we’re in agreement at least in terms of concrete action for an 
initial PR: make the kernels system more SIMD-amenable and enable the 
several-times-compilation of source files to at least enable the instruction 
sets. Next, we can evaluate which kernels it’s worth to rewrite in terms of 
xsimd. Does that sound right?

Sasha


> 3 апр. 2022 г., в 11:47, Antoine Pitrou <anto...@python.org> написал(а):
> 
> It would be a very significant contributor, as the inconsistency can manifest 
> under the form of up to 8-fold differences in performance (or perhaps more).

Reply via email to