Hi Kai,
I think we are in agreement.  For cross machine transfers, alignment
doesn't make a difference but width does.  We could optimize
transfers, by only sending non-padded buffers and then have clients
re-pad the data (but before we optimize we should likely have a
working implementation :)

The use-case I was thinking of for transferring alignment as part of
the metadata was for the shared-memory IPC.  I just double checked the
current IPC metadata
(https://github.com/apache/arrow/blob/master/format/Message.fbs), and
all the necessary information is already in there to check for
alignment and width.  So passing along the alignment/width
requirements shouldn't be necessary.  We should be able to check
offset and lengths when reading the metadata.

All of this might be getting into a little bit of premature
optimization though.  I will wait a little bit longer for comments
from others and then update the spec.

Thanks,
Micah

On Sat, Apr 9, 2016 at 4:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
> Hi Micah,
>
> Thanks for your thorough thoughts. The general consideration makes sense to 
> me building Arrow with SIMD support in mind meanwhile not complicating codes 
> too much, to use a fixed value 64 byte by default. We can always improve and 
> optimize this accordingly when have concrete solid algorithms and workloads 
> to benchmark with and collect real performance data, as you said.
>
> One thing I'm not sure about is, whether alignment requirement should be 
> included in IPC metadata, because in my understanding, no buffer address is 
> needed to be passed across machines, so it's up to destination machine to 
> decide how to reallocate the data buffer for receiving the transferred data 
> with whatever alignment address. An alignment address that's good to the 
> source machine but may be not good to the destination.
>
> So in summary, it's good to mention this alignment consideration in the spec, 
> also saying the fixed 64 byte alignment address is used by default; and 
> hard-code the fixed value in source codes (for example, when allocating 
> buffers for primitive arrays, chunk array buffers, and null bitmap buffers).
>
> Please help clarify if I'm not getting you right. Thanks.
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Micah Kornfield [mailto:emkornfi...@gmail.com]
> Sent: Saturday, April 09, 2016 12:56 PM
> To: dev@arrow.apache.org
> Subject: Re: Some questions/proposals for the spec (Layout.md)
>
> Hi Kai,
> Are you proposing making alignment and width part of the RPC metadata?
>   I think this is a good longer term idea, but for simplicity's sake, I think 
> starting with one fixed value is a good idea.
>
> I agree that in the general case guaranteeing alignment is difficult when we 
> have variable width data (e.g. strings) or sliced data 
> (https://issues.apache.org/jira/browse/ARROW-33).  However, I think a fairly 
> common use-case for Arrow will be dealing with fixed width non-nested types 
> (e.g. float, doubles, int32_t) where alignment can be
> guaranteed.   In these cases being able to make use of the optimal CPU
> instruction set is important.
>
> In this regard, one concern with 8 bytes as the default width is that it will 
> cause suboptimal use of current CPUs.  For instance, the Intel Optimization 
> Guide
> (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf)
> states "An access to data unaligned on 64-byte boundary leads to two memory 
> accesses and requires several μops to be executed (instead of one)." and "A 
> 64-byte or greater data structure or array should be aligned so that its base 
> address is a multiple of 64."
>
> It would be interesting to know the exact performance difference for compiler 
> generated code knowing about different degrees of alignment/width as well as 
> the performance difference using assembly/intrinsics.  In the absence of the 
> performance data, I think defaulting to 64 byte alignment (when the 
> programming language allows for it) based on recommendation from the guide 
> makes sense.  In addition given the existence of 512-bit SIMD, using 64 byte 
> padding for width also makes sense.
>
> Do you have concerns if we make 64 bytes the default instead of 8?
>
> Thanks,
> Micah
>
> On Fri, Apr 8, 2016 at 1:44 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>> I'm from Intel but not any hardware folks, just would provide my thoughts. 
>> Yes the width and alignment requirement can be very different according to 
>> what version of SIMD is used. And also, sometimes it's hard to keep the 
>> alignment to access specific fields or parts in the even aligned memory 
>> region. It's complex, I thought it's good to mention this aspect of 
>> consideration in the spec but come to the data structures or format, it can 
>> leave to platform specific optimizations regarding to concrete computing 
>> operators and algorithms to use alignment awareness buffer allocators 
>> considering this potential performance impact. A default value of 8 as 
>> mentioned may be used but other values can also be passed.
>>
>> Regards,
>> Kai
>>
>> -----Original Message-----
>> From: Wes McKinney [mailto:w...@cloudera.com]
>> Sent: Friday, April 08, 2016 11:40 PM
>> To: dev@arrow.apache.org
>> Subject: Re: Some questions/proposals for the spec (Layout.md)
>>
>> On the SIMD question, it seems AVX is going to 512 bits, so one could even 
>> argue for 64-byte alignment as a matter of future-proofing.  AVX2 / 256-bit 
>> seems fairly widely available nowadays, but it would be great if Todd or any 
>> of the hardware folks (e.g. from Intel) on the list could weigh in with 
>> guidance.
>>
>> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>>
>> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
>>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>>>
>>>>>
>>>>> > I believe this choice was primarily about simplifying the code
>>>>> > (similar
>>>>> to why we have a n+1
>>>>> > offsets instead of just n in the list/varchar representations
>>>>> > (even
>>>>> though n=0 is always 0)). In both
>>>>> > situations, you don't have to worry about writing special code
>>>>> > (and a
>>>>> condition) for the boundary
>>>>> > condition inside tight loops (e.g. the last few bytes need to be
>>>>> > handled
>>>>> differently since they
>>>>> > aren't word width).
>>>>>
>>>>> Sounds reasonable.  It might be worth illustrating this with a
>>>>> concrete example.  One scenario that this scheme seems useful for
>>>>> is a creating a new bitmap based on evaluating a predicate (i.e.
>>>>> all elements >X).  In this case would it make sense to make it a
>>>>> multiple of 16, so we can consistently use SIMD instructions for
>>>>> the logical "and" operation?
>>>>>
>>>>
>>>> Hmm... interesting thought. I'd have to look but I also recall some
>>>> of the newer stuff supporting even wider widths. What do others think?
>>>>
>>>>
>>>>> I think the spec is slightly inconsistent.  It says there is 6
>>>>> bytes of overhead per entry but then follows: "with the smallest
>>>>> byte width capable of representing the number of types in the union."
>>>>> I'm perfectly happy to say it is always 1, always 2, or always
>>>>> capped at 2.  I agree 32K/64K+ types is a very unlikely scenario.
>>>>> We just need to clear up the ambiguity.
>>>>>
>>>>
>>>> Agreed. Do you want to propose an approach & patch to clarify?
>>>
>>> I can also take responsibility for the ambiguity here. My preference
>>> is to use int16_t for the types array (memory suitably aligned), but
>>> as 1 byte will be sufficient nearly all of the time, it's a slight
>>> trade-off in memory use vs. code complexity, e.g.
>>>
>>> if (children_.size() < 128) {
>>>   // types is only 1 byte
>>> } else {
>>>   // types is 2 bytes
>>> }
>>>
>>> Realistically there won't be that many affected code paths, so I'm
>>> comfortable with either choice (2-bytes always, or 1 or 2 bytes
>>> depending on the size of the union).

Reply via email to