Don't trust that the compiler will auto-vectorize, tiny changes can have
drastic impacts.
SumScalar (benchmark-native):
│ state->total += *values++;
42 │ add (%rcx),%esi
68 │ mov %esi,(%rsp)
3 │ add 0x4(%rcx),%esi
36 │ mov %esi,(%rsp)
│ add 0x8(%rcx),%esi
86 │ mov %esi,(%rsp)
│ add 0xc(%rcx),%esi
51 │ mov %esi,(%rsp)
27 │ add 0x10(%rcx),%esi
05 │ mov %esi,(%rsp)
45 │ add 0x14(%rcx),%esi
22 │ mov %esi,(%rsp)
2 │ add 0x18(%rcx),%esi
68 │ mov %esi,(%rsp)
1 │ add 0x1c(%rcx),%esi
1 │ add $0x20,%rcx
54 │ mov %esi,(%rsp)
SumScalarRegister (benchmark-native):
│ sum += *values++;
228 │ vpaddd (%rcx,%rdi,4),%zmm0,%zmm0
164 │ vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1
163 │ vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2
438 │ vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3
207 │ vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0
146 │ vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1
176 │ vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2
447 │ vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3
210 │ vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0
161 │ vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1
173 │ vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2
392 │ vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3
217 │ vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0
153 │ vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1
156 │ vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2
654 │ vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3
perf strongly indicate that is highly memory bound, I would
expect/guesstimate the bitmap version to be slower when the memory
bandwidth is at full usage (multiple threads computing vectors?) which
might not show in a single thread benchmark.
On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <[email protected]> wrote:
> I'm surprised that using a stack variable has an impact, but I should
> probably update the benchmarks to do that (and merge with the
> SumState<T> at the end of the function) for thoroughness. Thanks!
> On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> <[email protected]> wrote:
> >
> > It seems the code for the naive Scalar example is not friendly with the
> > compiler auto-vectorization component. If you accumulate in a local state
> > (instead of SumState pointer), you'll get different results. at least
> with
> > clang++6.0.
> >
> > benchmark-noavx (only SSE):
> > BM_SumInt32Scalar 42390 us 42389 us 16
> > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > BM_SumInt32ScalarRegister 28036 us 28035 us 25
> > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > BM_SumInt32ScalarUnrolled 31660 us 31659 us 22
> > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > BM_SumInt32ScalarSentinel 49541 us 49540 us 14
> > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12
> > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > BM_SumInt64Scalar 65030 us 65030 us 11
> > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > BM_SumInt64ScalarRegister 55370 us 55369 us 13
> > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > BM_SumInt64ScalarUnrolled 61287 us 61286 us 11
> > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > BM_SumInt64ScalarSentinel 69792 us 69792 us 10
> > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10
> > bytes_per_second=10.89G/s items_per_second=1.46164G/
> >
> > benchmark-native (AVX512):
> > BM_SumInt32Scalar 43981 us 43980 us 16
> > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > BM_SumInt32ScalarRegister 28381 us 28380 us 24
> > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > BM_SumInt32ScalarUnrolled 79035 us 79033 us 9
> > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > BM_SumInt32ScalarSentinel 53430 us 53429 us 13
> > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22
> > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > BM_SumInt64Scalar 68112 us 68111 us 10
> > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > BM_SumInt64ScalarRegister 56188 us 56187 us 12
> > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > BM_SumInt64ScalarUnrolled 65649 us 65648 us 11
> > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > BM_SumInt64ScalarSentinel 69674 us 69673 us 10
> > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12
> > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> >
> > The interesting part here is that the ScalarSentinelRegister version is
> > very close the the fully vectorized ScalarRegister version on AVX512.
> > Antoine, the compiler is doing exactly what you're suggesting:
> >
> > 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> >
> > 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> >
> > 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> >
> > │ static inline bool NotNull(int64_t val) { return val !=
> > null_sentinel; }
> > │ vpcmpneqq %ymm13,%ymm8,%k2
> >
> > 25 │ vpcmpneqq %ymm13,%ymm9,%k3
> >
> > 20 │ vpcmpneqq %ymm13,%ymm10,%k4
> >
> > 72 │ vpcmpneqq %ymm13,%ymm11,%k1
> >
> > │ if (Traits<T>::NotNull(*values)) {
> >
> > │ vpbroadcastq %rbx,%ymm12{%k2}{z}
> >
> > 15 │ vpaddq %ymm12,%ymm3,%ymm3
> >
> > 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
> >
> > 57 │ vpaddq %ymm12,%ymm5,%ymm5
> >
> > │ vpbroadcastq %rbx,%ymm12{%k4}{z}
> >
> > 39 │ vpaddq %ymm12,%ymm6,%ymm6
> >
> > 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
> >
> > 88 │ vpaddq %ymm12,%ymm7,%ymm7
> >
> > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > didn't have time to look at the other Sum (bitmap) implementation and
> > didn't have time to clean the code. I'll try improve this over the
> weekend.
> >
> > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <[email protected]>
> wrote:
> >
> > > hi folks,
> > >
> > > I explored a bit the performance implications of using validity
> > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > NaN, INT32_MIN) for nulls:
> > >
> > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >
> > > The vectorization results may be of interest to those implementing
> > > analytic functions targeting the Arrow memory format. There's probably
> > > some other optimizations that can be employed, too.
> > >
> > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > checked the various implementations for correctness only, and did not
> > > dig too deeply beyond that.
> > >
> > > - Wes
> > >
> >
> >
> > --
> > Sent from my jetpack.
>
--
Sent from my jetpack.