On Mon, 6 Sep 2021 18:09:31 +0100 Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > Hi, > > We have a whole section related to byte alignment ( > https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding) > recommending 64 byte alignment and referring to intel's manual. > > Do we have evidence that this alignment helps (besides intel claims)?
I don't know if there is strong evidence for it. Modern CPUs are much better at cache-unaligned accesses than they used to be. It doesn't necessarily mean that such accesses are always free, however. It will certainly vary depending on the CPU model, but also depending on the workload (a compute-bound workload will of course suffer much less from any hypothetical alignment issue). Basically, depending on the CPU, an unaligned access *may* require more resources than an aligned access. For example, an unaligned AVX512 access would always straddle two 64-byte cache lines, and therefore issue two cache reads instead of one. But perhaps your CPU is capable of two cache reads per clock anyway? In this case, the problem would only show if you try to issue two AVX512 reads at once, which would require four cache reads in the unaligned case (say, you're adding two vectors instead of reduce-summing a single one). Generally, it should not hurt to align allocations to 64 bytes anyway, since you are generally dealing with large enough data that the (small) memory overhead doesn't matter. > Specifically, I performed two types of tests, a "random sum" where we > compute the sum of the values taken at random indices, and "sum", where we > sum all values of the array (buffer[1] of the primitive array), both for > array ranging from 2^10 to 2^25 elements. I was expecting that, at least in > the latter, prefetching would help, but I do not observe any difference. By prefetching, you mean explicit prefetching using intrinsics? Modern CPUs are very good at implicit prefetching, they are able to detect memory access patterns and optimize for them. Implicit prefetching would only possibly help if your access pattern is complicated (for example you're walking a chain of pointers). If your access is sequential, there is zero reason to prefetch explicitly nowadays, AFAIK. Regards Antoine.