Le 03/04/2022 à 21:38, Sasha Krassovsky a écrit :
There is concrete proof that autovectorization produces very flimsy results
(even on the same compiler, simply by varying the datatypes).
As I’ve shown, the Vector-Vector Add kernel example is consistently vectorized
well across compilers if written in a simple way.
Does it handle a validity bitmap efficiently? Does it handle an entire
range of datatypes? Does it handle both array and scalar inputs? If not,
how would you propose to handle all these? Chances are, you'll end up
rewriting another array of template abstractions.
Until I’ve seen a poorly-vectorized scalar kernel written as a simple for loop,
I consider these arguments theoretical as well.
This makes little sense. The Arrow C++ codebase is not "theoretical",
it's what you are presently working on.
It seems that we’re in agreement at least in terms of concrete action for an
initial PR: make the kernels system more SIMD-amenable and enable the
several-times-compilation of source files to at least enable the instruction
sets. Next, we can evaluate which kernels it’s worth to rewrite in terms of
xsimd. Does that sound right?
Indeed you can have an initial stab at that.
Regards
Antoine.
Sasha
3 апр. 2022 г., в 11:47, Antoine Pitrou <anto...@python.org> написал(а):
It would be a very significant contributor, as the inconsistency can manifest
under the form of up to 8-fold differences in performance (or perhaps more).