To add to Antoine's points, besides data alignment being beneficial for reducing cache line reads/write and overall using the cache more effectively, another key point is when using vector (SIMD) registers. Although recent CPUs can load unaligned data to vector registers at similar speeds as aligned data, it is always recommended to have your data aligned so that reads/writes to vector registers occur via aligned instruction calls. Also, there are cases where alignment is required by a specific library or API, so you are forced to abide by the alignment rules. In general, making your data align well with memory and CPU hardware is more efficient than not. That is why, C structs are padded, some memory allocators allocate to a multiple of cache line/page size, etc. I am glad that Arrow was designed with memory alignment in mind because this will make adding more vectorization functionality easier.
~Eduardo On Mon, Sep 6, 2021 at 5:21 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Thanks a lot Antoine for the pointers. Much appreciated! > > Generally, it should not hurt to align allocations to 64 bytes anyway, > > since you are generally dealing with large enough data that the > > (small) memory overhead doesn't matter. > > > > Not for performance. However, 64 byte alignment in Rust requires > maintaining a custom container, a custom allocator, and the inability to > interoperate with `std::Vec` and the ecosystem that is based on it, since > std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For anyone > interested, the background for this is this old PR [1] in this in arrow2 > [2]. > > Neither myself in micro benches nor Ritchie from polars (query engine) in > large scale benches observe any difference in the archs we have available. > This is not consistent with the emphasis we put on the memory alignments > discussion [3], and I am trying to understand the root cause for this > inconsistency. > > By prefetching I mean implicit; no intrinsics involved. > > Best, > Jorge > > [1] https://github.com/apache/arrow/pull/8796 > [2] https://github.com/jorgecarleitao/arrow2/pull/385 > [2] > > https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding > > > > > > On Mon, Sep 6, 2021 at 6:51 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 06/09/2021 à 19:45, Antoine Pitrou a écrit : > > > > > >> Specifically, I performed two types of tests, a "random sum" where we > > >> compute the sum of the values taken at random indices, and "sum", > where > > we > > >> sum all values of the array (buffer[1] of the primitive array), both > for > > >> array ranging from 2^10 to 2^25 elements. I was expecting that, at > > least in > > >> the latter, prefetching would help, but I do not observe any > difference. > > > > > > By prefetching, you mean explicit prefetching using intrinsics? > > > Modern CPUs are very good at implicit prefetching, they are able to > > > detect memory access patterns and optimize for them. Implicit > > > prefetching would only possibly help if your access pattern is > > > complicated (for example you're walking a chain of pointers). > > > > Oops: *explicit* prefecting would only possibly help.... sorry. > > > > Regards > > > > Antoine. > > > > > > > If your > > > access is sequential, there is zero reason to prefetch explicitly > > > nowadays, AFAIK. > > > > > > Regards > > > > > > Antoine. > > > > > > > > >