I'm surprised that using a stack variable has an impact, but I should
probably update the benchmarks to do that (and merge with the
SumState<T> at the end of the function) for thoroughness. Thanks!
On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
<fsaintjacq...@networkdump.com> wrote:
>
> It seems the code for the naive Scalar example is not friendly with the
> compiler auto-vectorization component. If you accumulate in a local state
> (instead of SumState pointer), you'll get different results. at least with
> clang++6.0.
>
> benchmark-noavx (only SSE):
> BM_SumInt32Scalar 42390 us 42389 us 16
> bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> BM_SumInt32ScalarRegister 28036 us 28035 us 25
> bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> BM_SumInt32ScalarUnrolled 31660 us 31659 us 22
> bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> BM_SumInt32ScalarSentinel 49541 us 49540 us 14
> bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12
> bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> BM_SumInt64Scalar 65030 us 65030 us 11
> bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> BM_SumInt64ScalarRegister 55370 us 55369 us 13
> bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> BM_SumInt64ScalarUnrolled 61287 us 61286 us 11
> bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> BM_SumInt64ScalarSentinel 69792 us 69792 us 10
> bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10
> bytes_per_second=10.89G/s items_per_second=1.46164G/
>
> benchmark-native (AVX512):
> BM_SumInt32Scalar 43981 us 43980 us 16
> bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> BM_SumInt32ScalarRegister 28381 us 28380 us 24
> bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> BM_SumInt32ScalarUnrolled 79035 us 79033 us 9
> bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> BM_SumInt32ScalarSentinel 53430 us 53429 us 13
> bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22
> bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> BM_SumInt64Scalar 68112 us 68111 us 10
> bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> BM_SumInt64ScalarRegister 56188 us 56187 us 12
> bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> BM_SumInt64ScalarUnrolled 65649 us 65648 us 11
> bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> BM_SumInt64ScalarSentinel 69674 us 69673 us 10
> bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12
> bytes_per_second=12.5857G/s items_per_second=1.68922G/s
>
> The interesting part here is that the ScalarSentinelRegister version is
> very close the the fully vectorized ScalarRegister version on AVX512.
> Antoine, the compiler is doing exactly what you're suggesting:
>
> 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
>
> 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
>
> 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
>
> │ static inline bool NotNull(int64_t val) { return val !=
> null_sentinel; }
> │ vpcmpneqq %ymm13,%ymm8,%k2
>
> 25 │ vpcmpneqq %ymm13,%ymm9,%k3
>
> 20 │ vpcmpneqq %ymm13,%ymm10,%k4
>
> 72 │ vpcmpneqq %ymm13,%ymm11,%k1
>
> │ if (Traits<T>::NotNull(*values)) {
>
> │ vpbroadcastq %rbx,%ymm12{%k2}{z}
>
> 15 │ vpaddq %ymm12,%ymm3,%ymm3
>
> 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
>
> 57 │ vpaddq %ymm12,%ymm5,%ymm5
>
> │ vpbroadcastq %rbx,%ymm12{%k4}{z}
>
> 39 │ vpaddq %ymm12,%ymm6,%ymm6
>
> 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
>
> 88 │ vpaddq %ymm12,%ymm7,%ymm7
>
> See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> didn't have time to look at the other Sum (bitmap) implementation and
> didn't have time to clean the code. I'll try improve this over the weekend.
>
> On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>
>
> --
> Sent from my jetpack.