On Mon, 6 Sep 2021 18:09:31 +0100
Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote:
> Hi,
> 
> We have a whole section related to byte alignment (
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding)
> recommending 64 byte alignment and referring to intel's manual.
> 
> Do we have evidence that this alignment helps (besides intel claims)?

I don't know if there is strong evidence for it. Modern CPUs are much
better at cache-unaligned accesses than they used to be. It doesn't
necessarily mean that such accesses are always free, however. It will
certainly vary depending on the CPU model, but also depending on the
workload (a compute-bound workload will of course suffer much less from
any hypothetical alignment issue).

Basically, depending on the CPU, an unaligned access *may* require
more resources than an aligned access. For example, an unaligned AVX512
access would always straddle two 64-byte cache lines, and therefore
issue two cache reads instead of one. But perhaps your CPU is capable
of two cache reads per clock anyway? In this case, the problem would
only show if you try to issue two AVX512 reads at once, which would
require four cache reads in the unaligned case (say, you're adding two
vectors instead of reduce-summing a single one).

Generally, it should not hurt to align allocations to 64 bytes anyway,
since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.

> Specifically, I performed two types of tests, a "random sum" where we
> compute the sum of the values taken at random indices, and "sum", where we
> sum all values of the array (buffer[1] of the primitive array), both for
> array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
> the latter, prefetching would help, but I do not observe any difference.

By prefetching, you mean explicit prefetching using intrinsics?
Modern CPUs are very good at implicit prefetching, they are able to
detect memory access patterns and optimize for them. Implicit
prefetching would only possibly help if your access pattern is
complicated (for example you're walking a chain of pointers). If your
access is sequential, there is zero reason to prefetch explicitly
nowadays, AFAIK.

Regards

Antoine.


Reply via email to