Hi Kai, I think we are in agreement. For cross machine transfers, alignment doesn't make a difference but width does. We could optimize transfers, by only sending non-padded buffers and then have clients re-pad the data (but before we optimize we should likely have a working implementation :)
The use-case I was thinking of for transferring alignment as part of the metadata was for the shared-memory IPC. I just double checked the current IPC metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs), and all the necessary information is already in there to check for alignment and width. So passing along the alignment/width requirements shouldn't be necessary. We should be able to check offset and lengths when reading the metadata. All of this might be getting into a little bit of premature optimization though. I will wait a little bit longer for comments from others and then update the spec. Thanks, Micah On Sat, Apr 9, 2016 at 4:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote: > Hi Micah, > > Thanks for your thorough thoughts. The general consideration makes sense to > me building Arrow with SIMD support in mind meanwhile not complicating codes > too much, to use a fixed value 64 byte by default. We can always improve and > optimize this accordingly when have concrete solid algorithms and workloads > to benchmark with and collect real performance data, as you said. > > One thing I'm not sure about is, whether alignment requirement should be > included in IPC metadata, because in my understanding, no buffer address is > needed to be passed across machines, so it's up to destination machine to > decide how to reallocate the data buffer for receiving the transferred data > with whatever alignment address. An alignment address that's good to the > source machine but may be not good to the destination. > > So in summary, it's good to mention this alignment consideration in the spec, > also saying the fixed 64 byte alignment address is used by default; and > hard-code the fixed value in source codes (for example, when allocating > buffers for primitive arrays, chunk array buffers, and null bitmap buffers). > > Please help clarify if I'm not getting you right. Thanks. > > Regards, > Kai > > -----Original Message----- > From: Micah Kornfield [mailto:emkornfi...@gmail.com] > Sent: Saturday, April 09, 2016 12:56 PM > To: dev@arrow.apache.org > Subject: Re: Some questions/proposals for the spec (Layout.md) > > Hi Kai, > Are you proposing making alignment and width part of the RPC metadata? > I think this is a good longer term idea, but for simplicity's sake, I think > starting with one fixed value is a good idea. > > I agree that in the general case guaranteeing alignment is difficult when we > have variable width data (e.g. strings) or sliced data > (https://issues.apache.org/jira/browse/ARROW-33). However, I think a fairly > common use-case for Arrow will be dealing with fixed width non-nested types > (e.g. float, doubles, int32_t) where alignment can be > guaranteed. In these cases being able to make use of the optimal CPU > instruction set is important. > > In this regard, one concern with 8 bytes as the default width is that it will > cause suboptimal use of current CPUs. For instance, the Intel Optimization > Guide > (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) > states "An access to data unaligned on 64-byte boundary leads to two memory > accesses and requires several μops to be executed (instead of one)." and "A > 64-byte or greater data structure or array should be aligned so that its base > address is a multiple of 64." > > It would be interesting to know the exact performance difference for compiler > generated code knowing about different degrees of alignment/width as well as > the performance difference using assembly/intrinsics. In the absence of the > performance data, I think defaulting to 64 byte alignment (when the > programming language allows for it) based on recommendation from the guide > makes sense. In addition given the existence of 512-bit SIMD, using 64 byte > padding for width also makes sense. > > Do you have concerns if we make 64 bytes the default instead of 8? > > Thanks, > Micah > > On Fri, Apr 8, 2016 at 1:44 PM, Zheng, Kai <kai.zh...@intel.com> wrote: >> I'm from Intel but not any hardware folks, just would provide my thoughts. >> Yes the width and alignment requirement can be very different according to >> what version of SIMD is used. And also, sometimes it's hard to keep the >> alignment to access specific fields or parts in the even aligned memory >> region. It's complex, I thought it's good to mention this aspect of >> consideration in the spec but come to the data structures or format, it can >> leave to platform specific optimizations regarding to concrete computing >> operators and algorithms to use alignment awareness buffer allocators >> considering this potential performance impact. A default value of 8 as >> mentioned may be used but other values can also be passed. >> >> Regards, >> Kai >> >> -----Original Message----- >> From: Wes McKinney [mailto:w...@cloudera.com] >> Sent: Friday, April 08, 2016 11:40 PM >> To: dev@arrow.apache.org >> Subject: Re: Some questions/proposals for the spec (Layout.md) >> >> On the SIMD question, it seems AVX is going to 512 bits, so one could even >> argue for 64-byte alignment as a matter of future-proofing. AVX2 / 256-bit >> seems fairly widely available nowadays, but it would be great if Todd or any >> of the hardware folks (e.g. from Intel) on the list could weigh in with >> guidance. >> >> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions >> >> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: >>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote: >>>>> >>>>> >>>>> > I believe this choice was primarily about simplifying the code >>>>> > (similar >>>>> to why we have a n+1 >>>>> > offsets instead of just n in the list/varchar representations >>>>> > (even >>>>> though n=0 is always 0)). In both >>>>> > situations, you don't have to worry about writing special code >>>>> > (and a >>>>> condition) for the boundary >>>>> > condition inside tight loops (e.g. the last few bytes need to be >>>>> > handled >>>>> differently since they >>>>> > aren't word width). >>>>> >>>>> Sounds reasonable. It might be worth illustrating this with a >>>>> concrete example. One scenario that this scheme seems useful for >>>>> is a creating a new bitmap based on evaluating a predicate (i.e. >>>>> all elements >X). In this case would it make sense to make it a >>>>> multiple of 16, so we can consistently use SIMD instructions for >>>>> the logical "and" operation? >>>>> >>>> >>>> Hmm... interesting thought. I'd have to look but I also recall some >>>> of the newer stuff supporting even wider widths. What do others think? >>>> >>>> >>>>> I think the spec is slightly inconsistent. It says there is 6 >>>>> bytes of overhead per entry but then follows: "with the smallest >>>>> byte width capable of representing the number of types in the union." >>>>> I'm perfectly happy to say it is always 1, always 2, or always >>>>> capped at 2. I agree 32K/64K+ types is a very unlikely scenario. >>>>> We just need to clear up the ambiguity. >>>>> >>>> >>>> Agreed. Do you want to propose an approach & patch to clarify? >>> >>> I can also take responsibility for the ambiguity here. My preference >>> is to use int16_t for the types array (memory suitably aligned), but >>> as 1 byte will be sufficient nearly all of the time, it's a slight >>> trade-off in memory use vs. code complexity, e.g. >>> >>> if (children_.size() < 128) { >>> // types is only 1 byte >>> } else { >>> // types is 2 bytes >>> } >>> >>> Realistically there won't be that many affected code paths, so I'm >>> comfortable with either choice (2-bytes always, or 1 or 2 bytes >>> depending on the size of the union).