Memory binding can be viewed as opportunity for melding multiple
aggregators.

For instance, any additional aggregation comes nearly for free.  Sum and
count (non zero) will be the same as either alone.  Or sum and sum of
squares.



On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques <
fsaintjacq...@networkdump.com> wrote:

> Don't trust that the compiler will auto-vectorize, tiny changes can have
> drastic impacts.
>
> SumScalar (benchmark-native):
>    │           state->total += *values++;
> 42 │       add    (%rcx),%esi
> 68 │       mov    %esi,(%rsp)
>  3 │       add    0x4(%rcx),%esi
> 36 │       mov    %esi,(%rsp)
>    │       add    0x8(%rcx),%esi
> 86 │       mov    %esi,(%rsp)
>    │       add    0xc(%rcx),%esi
> 51 │       mov    %esi,(%rsp)
> 27 │       add    0x10(%rcx),%esi
> 05 │       mov    %esi,(%rsp)
> 45 │       add    0x14(%rcx),%esi
> 22 │       mov    %esi,(%rsp)
>  2 │       add    0x18(%rcx),%esi
> 68 │       mov    %esi,(%rsp)
>  1 │       add    0x1c(%rcx),%esi
>  1 │       add    $0x20,%rcx
> 54 │       mov    %esi,(%rsp)
>
> SumScalarRegister (benchmark-native):
>        │           sum += *values++;
>
>
>
>
>    228 │       vpaddd (%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    164 │       vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    163 │       vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    438 │       vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    207 │       vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    146 │       vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    176 │       vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    447 │       vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    210 │       vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    161 │       vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    173 │       vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    392 │       vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    217 │       vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    153 │       vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    156 │       vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    654 │       vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3
>
> perf strongly indicate that is highly memory bound, I would
> expect/guesstimate the bitmap version to be slower when the memory
> bandwidth is at full usage (multiple threads computing vectors?) which
> might not show in a single thread benchmark.
>
> On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > I'm surprised that using a stack variable has an impact, but I should
> > probably update the benchmarks to do that (and merge with the
> > SumState<T> at the end of the function) for thoroughness. Thanks!
> > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> > <fsaintjacq...@networkdump.com> wrote:
> > >
> > > It seems the code for the naive Scalar example is not friendly with the
> > > compiler auto-vectorization component. If you accumulate in a local
> state
> > > (instead of SumState pointer), you'll get different results. at least
> > with
> > > clang++6.0.
> > >
> > > benchmark-noavx (only SSE):
> > > BM_SumInt32Scalar                      42390 us      42389 us
>  16
> > > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > > BM_SumInt32ScalarRegister              28036 us      28035 us
>  25
> > > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > > BM_SumInt32ScalarUnrolled              31660 us      31659 us
>  22
> > > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > > BM_SumInt32ScalarSentinel              49541 us      49540 us
>  14
> > > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > > BM_SumInt32ScalarSentinelRegister      55655 us      55654 us
>  12
> > > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > > BM_SumInt64Scalar                      65030 us      65030 us
>  11
> > > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > > BM_SumInt64ScalarRegister              55370 us      55369 us
>  13
> > > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > > BM_SumInt64ScalarUnrolled              61287 us      61286 us
>  11
> > > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > > BM_SumInt64ScalarSentinel              69792 us      69792 us
>  10
> > > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > > BM_SumInt64ScalarSentinelRegister      68418 us      68416 us
>  10
> > > bytes_per_second=10.89G/s items_per_second=1.46164G/
> > >
> > > benchmark-native (AVX512):
> > > BM_SumInt32Scalar                      43981 us      43980 us
>  16
> > > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > > BM_SumInt32ScalarRegister              28381 us      28380 us
>  24
> > > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > > BM_SumInt32ScalarUnrolled              79035 us      79033 us
> 9
> > > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > > BM_SumInt32ScalarSentinel              53430 us      53429 us
>  13
> > > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > > BM_SumInt32ScalarSentinelRegister      32450 us      32450 us
>  22
> > > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > > BM_SumInt64Scalar                      68112 us      68111 us
>  10
> > > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > > BM_SumInt64ScalarRegister              56188 us      56187 us
>  12
> > > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > > BM_SumInt64ScalarUnrolled              65649 us      65648 us
>  11
> > > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > > BM_SumInt64ScalarSentinel              69674 us      69673 us
>  10
> > > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > > BM_SumInt64ScalarSentinelRegister      59200 us      59199 us
>  12
> > > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> > >
> > > The interesting part here is that the ScalarSentinelRegister version is
> > > very close the the fully vectorized ScalarRegister version on AVX512.
> > > Antoine, the compiler is doing exactly what you're suggesting:
> > >
> > > 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> > >
> > >  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> > >
> > > 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> > >
> > >     │       static inline bool NotNull(int64_t val) { return val !=
> > > null_sentinel; }
> > >     │       vpcmpneqq %ymm13,%ymm8,%k2
> > >
> > >  25 │       vpcmpneqq %ymm13,%ymm9,%k3
> > >
> > >  20 │       vpcmpneqq %ymm13,%ymm10,%k4
> > >
> > >  72 │       vpcmpneqq %ymm13,%ymm11,%k1
> > >
> > >     │           if (Traits<T>::NotNull(*values)) {
> > >
> > >     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
> > >
> > >  15 │       vpaddq %ymm12,%ymm3,%ymm3
> > >
> > >   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
> > >
> > >  57 │       vpaddq %ymm12,%ymm5,%ymm5
> > >
> > >     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
> > >
> > >  39 │       vpaddq %ymm12,%ymm6,%ymm6
> > >
> > >  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
> > >
> > >  88 │       vpaddq %ymm12,%ymm7,%ymm7
> > >
> > > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > > didn't have time to look at the other Sum (bitmap) implementation and
> > > didn't have time to clean the code. I'll try improve this over the
> > weekend.
> > >
> > > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > I explored a bit the performance implications of using validity
> > > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > > NaN, INT32_MIN) for nulls:
> > > >
> > > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > > >
> > > > The vectorization results may be of interest to those implementing
> > > > analytic functions targeting the Arrow memory format. There's
> probably
> > > > some other optimizations that can be employed, too.
> > > >
> > > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > > checked the various implementations for correctness only, and did not
> > > > dig too deeply beyond that.
> > > >
> > > > - Wes
> > > >
> > >
> > >
> > > --
> > > Sent from my jetpack.
> >
>
>
> --
> Sent from my jetpack.
>

Reply via email to