Don't trust that the compiler will auto-vectorize, tiny changes can have
drastic impacts.

SumScalar (benchmark-native):
   │           state->total += *values++;
42 │       add    (%rcx),%esi
68 │       mov    %esi,(%rsp)
 3 │       add    0x4(%rcx),%esi
36 │       mov    %esi,(%rsp)
   │       add    0x8(%rcx),%esi
86 │       mov    %esi,(%rsp)
   │       add    0xc(%rcx),%esi
51 │       mov    %esi,(%rsp)
27 │       add    0x10(%rcx),%esi
05 │       mov    %esi,(%rsp)
45 │       add    0x14(%rcx),%esi
22 │       mov    %esi,(%rsp)
 2 │       add    0x18(%rcx),%esi
68 │       mov    %esi,(%rsp)
 1 │       add    0x1c(%rcx),%esi
 1 │       add    $0x20,%rcx
54 │       mov    %esi,(%rsp)

SumScalarRegister (benchmark-native):
       │           sum += *values++;




   228 │       vpaddd (%rcx,%rdi,4),%zmm0,%zmm0




   164 │       vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1




   163 │       vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2




   438 │       vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3




   207 │       vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0




   146 │       vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1




   176 │       vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2




   447 │       vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3




   210 │       vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0




   161 │       vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1




   173 │       vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2




   392 │       vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3




   217 │       vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0




   153 │       vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1




   156 │       vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2




   654 │       vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3

perf strongly indicate that is highly memory bound, I would
expect/guesstimate the bitmap version to be slower when the memory
bandwidth is at full usage (multiple threads computing vectors?) which
might not show in a single thread benchmark.

On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I'm surprised that using a stack variable has an impact, but I should
> probably update the benchmarks to do that (and merge with the
> SumState<T> at the end of the function) for thoroughness. Thanks!
> On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> <fsaintjacq...@networkdump.com> wrote:
> >
> > It seems the code for the naive Scalar example is not friendly with the
> > compiler auto-vectorization component. If you accumulate in a local state
> > (instead of SumState pointer), you'll get different results. at least
> with
> > clang++6.0.
> >
> > benchmark-noavx (only SSE):
> > BM_SumInt32Scalar                      42390 us      42389 us         16
> > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > BM_SumInt32ScalarRegister              28036 us      28035 us         25
> > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
> > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > BM_SumInt32ScalarSentinel              49541 us      49540 us         14
> > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
> > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > BM_SumInt64Scalar                      65030 us      65030 us         11
> > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > BM_SumInt64ScalarRegister              55370 us      55369 us         13
> > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
> > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > BM_SumInt64ScalarSentinel              69792 us      69792 us         10
> > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
> > bytes_per_second=10.89G/s items_per_second=1.46164G/
> >
> > benchmark-native (AVX512):
> > BM_SumInt32Scalar                      43981 us      43980 us         16
> > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > BM_SumInt32ScalarRegister              28381 us      28380 us         24
> > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
> > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > BM_SumInt32ScalarSentinel              53430 us      53429 us         13
> > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
> > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > BM_SumInt64Scalar                      68112 us      68111 us         10
> > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > BM_SumInt64ScalarRegister              56188 us      56187 us         12
> > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
> > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > BM_SumInt64ScalarSentinel              69674 us      69673 us         10
> > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
> > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> >
> > The interesting part here is that the ScalarSentinelRegister version is
> > very close the the fully vectorized ScalarRegister version on AVX512.
> > Antoine, the compiler is doing exactly what you're suggesting:
> >
> > 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> >
> >  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> >
> > 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> >
> >     │       static inline bool NotNull(int64_t val) { return val !=
> > null_sentinel; }
> >     │       vpcmpneqq %ymm13,%ymm8,%k2
> >
> >  25 │       vpcmpneqq %ymm13,%ymm9,%k3
> >
> >  20 │       vpcmpneqq %ymm13,%ymm10,%k4
> >
> >  72 │       vpcmpneqq %ymm13,%ymm11,%k1
> >
> >     │           if (Traits<T>::NotNull(*values)) {
> >
> >     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
> >
> >  15 │       vpaddq %ymm12,%ymm3,%ymm3
> >
> >   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
> >
> >  57 │       vpaddq %ymm12,%ymm5,%ymm5
> >
> >     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
> >
> >  39 │       vpaddq %ymm12,%ymm6,%ymm6
> >
> >  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
> >
> >  88 │       vpaddq %ymm12,%ymm7,%ymm7
> >
> > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > didn't have time to look at the other Sum (bitmap) implementation and
> > didn't have time to clean the code. I'll try improve this over the
> weekend.
> >
> > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > I explored a bit the performance implications of using validity
> > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > NaN, INT32_MIN) for nulls:
> > >
> > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >
> > > The vectorization results may be of interest to those implementing
> > > analytic functions targeting the Arrow memory format. There's probably
> > > some other optimizations that can be employed, too.
> > >
> > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > checked the various implementations for correctness only, and did not
> > > dig too deeply beyond that.
> > >
> > > - Wes
> > >
> >
> >
> > --
> > Sent from my jetpack.
>


-- 
Sent from my jetpack.

Reply via email to