I'm surprised that using a stack variable has an impact, but I should
probably update the benchmarks to do that (and merge with the
SumState<T> at the end of the function) for thoroughness. Thanks!
On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
<fsaintjacq...@networkdump.com> wrote:
>
> It seems the code for the naive Scalar example is not friendly with the
> compiler auto-vectorization component. If you accumulate in a local state
> (instead of SumState pointer), you'll get different results. at least with
> clang++6.0.
>
> benchmark-noavx (only SSE):
> BM_SumInt32Scalar                      42390 us      42389 us         16
> bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> BM_SumInt32ScalarRegister              28036 us      28035 us         25
> bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
> bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> BM_SumInt32ScalarSentinel              49541 us      49540 us         14
> bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
> bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> BM_SumInt64Scalar                      65030 us      65030 us         11
> bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> BM_SumInt64ScalarRegister              55370 us      55369 us         13
> bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
> bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> BM_SumInt64ScalarSentinel              69792 us      69792 us         10
> bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
> bytes_per_second=10.89G/s items_per_second=1.46164G/
>
> benchmark-native (AVX512):
> BM_SumInt32Scalar                      43981 us      43980 us         16
> bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> BM_SumInt32ScalarRegister              28381 us      28380 us         24
> bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
> bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> BM_SumInt32ScalarSentinel              53430 us      53429 us         13
> bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
> bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> BM_SumInt64Scalar                      68112 us      68111 us         10
> bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> BM_SumInt64ScalarRegister              56188 us      56187 us         12
> bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
> bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> BM_SumInt64ScalarSentinel              69674 us      69673 us         10
> bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
> bytes_per_second=12.5857G/s items_per_second=1.68922G/s
>
> The interesting part here is that the ScalarSentinelRegister version is
> very close the the fully vectorized ScalarRegister version on AVX512.
> Antoine, the compiler is doing exactly what you're suggesting:
>
> 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
>
>  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
>
> 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
>
>     │       static inline bool NotNull(int64_t val) { return val !=
> null_sentinel; }
>     │       vpcmpneqq %ymm13,%ymm8,%k2
>
>  25 │       vpcmpneqq %ymm13,%ymm9,%k3
>
>  20 │       vpcmpneqq %ymm13,%ymm10,%k4
>
>  72 │       vpcmpneqq %ymm13,%ymm11,%k1
>
>     │           if (Traits<T>::NotNull(*values)) {
>
>     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
>
>  15 │       vpaddq %ymm12,%ymm3,%ymm3
>
>   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
>
>  57 │       vpaddq %ymm12,%ymm5,%ymm5
>
>     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
>
>  39 │       vpaddq %ymm12,%ymm6,%ymm6
>
>  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
>
>  88 │       vpaddq %ymm12,%ymm7,%ymm7
>
> See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> didn't have time to look at the other Sum (bitmap) implementation and
> didn't have time to clean the code. I'll try improve this over the weekend.
>
> On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>
>
> --
> Sent from my jetpack.

Reply via email to