Don't trust that the compiler will auto-vectorize, tiny changes can have drastic impacts.
SumScalar (benchmark-native): │ state->total += *values++; 42 │ add (%rcx),%esi 68 │ mov %esi,(%rsp) 3 │ add 0x4(%rcx),%esi 36 │ mov %esi,(%rsp) │ add 0x8(%rcx),%esi 86 │ mov %esi,(%rsp) │ add 0xc(%rcx),%esi 51 │ mov %esi,(%rsp) 27 │ add 0x10(%rcx),%esi 05 │ mov %esi,(%rsp) 45 │ add 0x14(%rcx),%esi 22 │ mov %esi,(%rsp) 2 │ add 0x18(%rcx),%esi 68 │ mov %esi,(%rsp) 1 │ add 0x1c(%rcx),%esi 1 │ add $0x20,%rcx 54 │ mov %esi,(%rsp) SumScalarRegister (benchmark-native): │ sum += *values++; 228 │ vpaddd (%rcx,%rdi,4),%zmm0,%zmm0 164 │ vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1 163 │ vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2 438 │ vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3 207 │ vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0 146 │ vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1 176 │ vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2 447 │ vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3 210 │ vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0 161 │ vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1 173 │ vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2 392 │ vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3 217 │ vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0 153 │ vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1 156 │ vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2 654 │ vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3 perf strongly indicate that is highly memory bound, I would expect/guesstimate the bitmap version to be slower when the memory bandwidth is at full usage (multiple threads computing vectors?) which might not show in a single thread benchmark. On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <wesmck...@gmail.com> wrote: > I'm surprised that using a stack variable has an impact, but I should > probably update the benchmarks to do that (and merge with the > SumState<T> at the end of the function) for thoroughness. Thanks! > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques > <fsaintjacq...@networkdump.com> wrote: > > > > It seems the code for the naive Scalar example is not friendly with the > > compiler auto-vectorization component. If you accumulate in a local state > > (instead of SumState pointer), you'll get different results. at least > with > > clang++6.0. > > > > benchmark-noavx (only SSE): > > BM_SumInt32Scalar 42390 us 42389 us 16 > > bytes_per_second=8.78824G/s items_per_second=2.35908G/s > > BM_SumInt32ScalarRegister 28036 us 28035 us 25 > > bytes_per_second=13.2878G/s items_per_second=3.56691G/s > > BM_SumInt32ScalarUnrolled 31660 us 31659 us 22 > > bytes_per_second=11.7668G/s items_per_second=3.15863G/s > > BM_SumInt32ScalarSentinel 49541 us 49540 us 14 > > bytes_per_second=7.51971G/s items_per_second=2.01856G/s > > BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12 > > bytes_per_second=6.69369G/s items_per_second=1.79683G/s > > BM_SumInt64Scalar 65030 us 65030 us 11 > > bytes_per_second=11.4572G/s items_per_second=1.53776G/s > > BM_SumInt64ScalarRegister 55370 us 55369 us 13 > > bytes_per_second=13.4563G/s items_per_second=1.80607G/s > > BM_SumInt64ScalarUnrolled 61287 us 61286 us 11 > > bytes_per_second=12.1571G/s items_per_second=1.6317G/s > > BM_SumInt64ScalarSentinel 69792 us 69792 us 10 > > bytes_per_second=10.6755G/s items_per_second=1.43284G/s > > BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10 > > bytes_per_second=10.89G/s items_per_second=1.46164G/ > > > > benchmark-native (AVX512): > > BM_SumInt32Scalar 43981 us 43980 us 16 > > bytes_per_second=8.47046G/s items_per_second=2.27377G/s > > BM_SumInt32ScalarRegister 28381 us 28380 us 24 > > bytes_per_second=13.1263G/s items_per_second=3.52355G/s > > BM_SumInt32ScalarUnrolled 79035 us 79033 us 9 > > bytes_per_second=4.71361G/s items_per_second=1.2653G/s > > BM_SumInt32ScalarSentinel 53430 us 53429 us 13 > > bytes_per_second=6.97235G/s items_per_second=1.87163G/s > > BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22 > > bytes_per_second=11.4802G/s items_per_second=3.0817G/s > > BM_SumInt64Scalar 68112 us 68111 us 10 > > bytes_per_second=10.9389G/s items_per_second=1.46819G/s > > BM_SumInt64ScalarRegister 56188 us 56187 us 12 > > bytes_per_second=13.2603G/s items_per_second=1.77977G/s > > BM_SumInt64ScalarUnrolled 65649 us 65648 us 11 > > bytes_per_second=11.3492G/s items_per_second=1.52327G/s > > BM_SumInt64ScalarSentinel 69674 us 69673 us 10 > > bytes_per_second=10.6937G/s items_per_second=1.43528G/s > > BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12 > > bytes_per_second=12.5857G/s items_per_second=1.68922G/s > > > > The interesting part here is that the ScalarSentinelRegister version is > > very close the the fully vectorized ScalarRegister version on AVX512. > > Antoine, the compiler is doing exactly what you're suggesting: > > > > 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9 > > > > 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10 > > > > 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11 > > > > │ static inline bool NotNull(int64_t val) { return val != > > null_sentinel; } > > │ vpcmpneqq %ymm13,%ymm8,%k2 > > > > 25 │ vpcmpneqq %ymm13,%ymm9,%k3 > > > > 20 │ vpcmpneqq %ymm13,%ymm10,%k4 > > > > 72 │ vpcmpneqq %ymm13,%ymm11,%k1 > > > > │ if (Traits<T>::NotNull(*values)) { > > > > │ vpbroadcastq %rbx,%ymm12{%k2}{z} > > > > 15 │ vpaddq %ymm12,%ymm3,%ymm3 > > > > 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z} > > > > 57 │ vpaddq %ymm12,%ymm5,%ymm5 > > > > │ vpbroadcastq %rbx,%ymm12{%k4}{z} > > > > 39 │ vpaddq %ymm12,%ymm6,%ymm6 > > > > 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z} > > > > 88 │ vpaddq %ymm12,%ymm7,%ymm7 > > > > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I > > didn't have time to look at the other Sum (bitmap) implementation and > > didn't have time to clean the code. I'll try improve this over the > weekend. > > > > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > hi folks, > > > > > > I explored a bit the performance implications of using validity > > > bitmaps (like the Arrow columnar format) vs. sentinel values (like > > > NaN, INT32_MIN) for nulls: > > > > > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ > > > > > > The vectorization results may be of interest to those implementing > > > analytic functions targeting the Arrow memory format. There's probably > > > some other optimizations that can be employed, too. > > > > > > Caveat: it's entirely possible I made some mistakes in my code. I > > > checked the various implementations for correctness only, and did not > > > dig too deeply beyond that. > > > > > > - Wes > > > > > > > > > -- > > Sent from my jetpack. > -- Sent from my jetpack.