Memory binding can be viewed as opportunity for melding multiple aggregators.
For instance, any additional aggregation comes nearly for free. Sum and count (non zero) will be the same as either alone. Or sum and sum of squares. On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques < fsaintjacq...@networkdump.com> wrote: > Don't trust that the compiler will auto-vectorize, tiny changes can have > drastic impacts. > > SumScalar (benchmark-native): > │ state->total += *values++; > 42 │ add (%rcx),%esi > 68 │ mov %esi,(%rsp) > 3 │ add 0x4(%rcx),%esi > 36 │ mov %esi,(%rsp) > │ add 0x8(%rcx),%esi > 86 │ mov %esi,(%rsp) > │ add 0xc(%rcx),%esi > 51 │ mov %esi,(%rsp) > 27 │ add 0x10(%rcx),%esi > 05 │ mov %esi,(%rsp) > 45 │ add 0x14(%rcx),%esi > 22 │ mov %esi,(%rsp) > 2 │ add 0x18(%rcx),%esi > 68 │ mov %esi,(%rsp) > 1 │ add 0x1c(%rcx),%esi > 1 │ add $0x20,%rcx > 54 │ mov %esi,(%rsp) > > SumScalarRegister (benchmark-native): > │ sum += *values++; > > > > > 228 │ vpaddd (%rcx,%rdi,4),%zmm0,%zmm0 > > > > > 164 │ vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1 > > > > > 163 │ vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2 > > > > > 438 │ vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3 > > > > > 207 │ vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0 > > > > > 146 │ vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1 > > > > > 176 │ vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2 > > > > > 447 │ vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3 > > > > > 210 │ vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0 > > > > > 161 │ vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1 > > > > > 173 │ vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2 > > > > > 392 │ vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3 > > > > > 217 │ vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0 > > > > > 153 │ vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1 > > > > > 156 │ vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2 > > > > > 654 │ vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3 > > perf strongly indicate that is highly memory bound, I would > expect/guesstimate the bitmap version to be slower when the memory > bandwidth is at full usage (multiple threads computing vectors?) which > might not show in a single thread benchmark. > > On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > I'm surprised that using a stack variable has an impact, but I should > > probably update the benchmarks to do that (and merge with the > > SumState<T> at the end of the function) for thoroughness. Thanks! > > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques > > <fsaintjacq...@networkdump.com> wrote: > > > > > > It seems the code for the naive Scalar example is not friendly with the > > > compiler auto-vectorization component. If you accumulate in a local > state > > > (instead of SumState pointer), you'll get different results. at least > > with > > > clang++6.0. > > > > > > benchmark-noavx (only SSE): > > > BM_SumInt32Scalar 42390 us 42389 us > 16 > > > bytes_per_second=8.78824G/s items_per_second=2.35908G/s > > > BM_SumInt32ScalarRegister 28036 us 28035 us > 25 > > > bytes_per_second=13.2878G/s items_per_second=3.56691G/s > > > BM_SumInt32ScalarUnrolled 31660 us 31659 us > 22 > > > bytes_per_second=11.7668G/s items_per_second=3.15863G/s > > > BM_SumInt32ScalarSentinel 49541 us 49540 us > 14 > > > bytes_per_second=7.51971G/s items_per_second=2.01856G/s > > > BM_SumInt32ScalarSentinelRegister 55655 us 55654 us > 12 > > > bytes_per_second=6.69369G/s items_per_second=1.79683G/s > > > BM_SumInt64Scalar 65030 us 65030 us > 11 > > > bytes_per_second=11.4572G/s items_per_second=1.53776G/s > > > BM_SumInt64ScalarRegister 55370 us 55369 us > 13 > > > bytes_per_second=13.4563G/s items_per_second=1.80607G/s > > > BM_SumInt64ScalarUnrolled 61287 us 61286 us > 11 > > > bytes_per_second=12.1571G/s items_per_second=1.6317G/s > > > BM_SumInt64ScalarSentinel 69792 us 69792 us > 10 > > > bytes_per_second=10.6755G/s items_per_second=1.43284G/s > > > BM_SumInt64ScalarSentinelRegister 68418 us 68416 us > 10 > > > bytes_per_second=10.89G/s items_per_second=1.46164G/ > > > > > > benchmark-native (AVX512): > > > BM_SumInt32Scalar 43981 us 43980 us > 16 > > > bytes_per_second=8.47046G/s items_per_second=2.27377G/s > > > BM_SumInt32ScalarRegister 28381 us 28380 us > 24 > > > bytes_per_second=13.1263G/s items_per_second=3.52355G/s > > > BM_SumInt32ScalarUnrolled 79035 us 79033 us > 9 > > > bytes_per_second=4.71361G/s items_per_second=1.2653G/s > > > BM_SumInt32ScalarSentinel 53430 us 53429 us > 13 > > > bytes_per_second=6.97235G/s items_per_second=1.87163G/s > > > BM_SumInt32ScalarSentinelRegister 32450 us 32450 us > 22 > > > bytes_per_second=11.4802G/s items_per_second=3.0817G/s > > > BM_SumInt64Scalar 68112 us 68111 us > 10 > > > bytes_per_second=10.9389G/s items_per_second=1.46819G/s > > > BM_SumInt64ScalarRegister 56188 us 56187 us > 12 > > > bytes_per_second=13.2603G/s items_per_second=1.77977G/s > > > BM_SumInt64ScalarUnrolled 65649 us 65648 us > 11 > > > bytes_per_second=11.3492G/s items_per_second=1.52327G/s > > > BM_SumInt64ScalarSentinel 69674 us 69673 us > 10 > > > bytes_per_second=10.6937G/s items_per_second=1.43528G/s > > > BM_SumInt64ScalarSentinelRegister 59200 us 59199 us > 12 > > > bytes_per_second=12.5857G/s items_per_second=1.68922G/s > > > > > > The interesting part here is that the ScalarSentinelRegister version is > > > very close the the fully vectorized ScalarRegister version on AVX512. > > > Antoine, the compiler is doing exactly what you're suggesting: > > > > > > 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9 > > > > > > 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10 > > > > > > 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11 > > > > > > │ static inline bool NotNull(int64_t val) { return val != > > > null_sentinel; } > > > │ vpcmpneqq %ymm13,%ymm8,%k2 > > > > > > 25 │ vpcmpneqq %ymm13,%ymm9,%k3 > > > > > > 20 │ vpcmpneqq %ymm13,%ymm10,%k4 > > > > > > 72 │ vpcmpneqq %ymm13,%ymm11,%k1 > > > > > > │ if (Traits<T>::NotNull(*values)) { > > > > > > │ vpbroadcastq %rbx,%ymm12{%k2}{z} > > > > > > 15 │ vpaddq %ymm12,%ymm3,%ymm3 > > > > > > 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z} > > > > > > 57 │ vpaddq %ymm12,%ymm5,%ymm5 > > > > > > │ vpbroadcastq %rbx,%ymm12{%k4}{z} > > > > > > 39 │ vpaddq %ymm12,%ymm6,%ymm6 > > > > > > 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z} > > > > > > 88 │ vpaddq %ymm12,%ymm7,%ymm7 > > > > > > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I > > > didn't have time to look at the other Sum (bitmap) implementation and > > > didn't have time to clean the code. I'll try improve this over the > > weekend. > > > > > > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > hi folks, > > > > > > > > I explored a bit the performance implications of using validity > > > > bitmaps (like the Arrow columnar format) vs. sentinel values (like > > > > NaN, INT32_MIN) for nulls: > > > > > > > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ > > > > > > > > The vectorization results may be of interest to those implementing > > > > analytic functions targeting the Arrow memory format. There's > probably > > > > some other optimizations that can be employed, too. > > > > > > > > Caveat: it's entirely possible I made some mistakes in my code. I > > > > checked the various implementations for correctness only, and did not > > > > dig too deeply beyond that. > > > > > > > > - Wes > > > > > > > > > > > > > -- > > > Sent from my jetpack. > > > > > -- > Sent from my jetpack. >