> It would be a very significant contributor, as the inconsistency can manifest > under the form of up to 8-fold differences in performance (or perhaps more).
This is on a micro benchmark. For a user workload, the kernel will account for maybe 20% of the runtime, so even if the kernel gets 10x faster the user workload will only be 18% faster (or in the ballpark, I didn’t math it rigorously). > There is concrete proof that autovectorization produces very flimsy results > (even on the same compiler, simply by varying the datatypes). There is concrete proof of flimsy results for large template monsters, hidden behind layers of indirection across several source files. As I’ve shown, the Vector-Vector Add kernel example is consistently vectorized well across compilers if written in a simple way. Until I’ve seen a poorly-vectorized scalar kernel written as a simple for loop, I consider these arguments theoretical as well. > There is a far cry however, between the proposal of leveraging > autovectorization as a first step towards better performance Yes, I did amend my proposal earlier in this thread, saying that leaving xsimd in and using it for kernels that don’t autovectorize well would work. It seems that we’re in agreement at least in terms of concrete action for an initial PR: make the kernels system more SIMD-amenable and enable the several-times-compilation of source files to at least enable the instruction sets. Next, we can evaluate which kernels it’s worth to rewrite in terms of xsimd. Does that sound right? Sasha > 3 апр. 2022 г., в 11:47, Antoine Pitrou <anto...@python.org> написал(а): > > It would be a very significant contributor, as the inconsistency can manifest > under the form of up to 8-fold differences in performance (or perhaps more).