Hi Jacques,
The paragraph you wrote doesn't quite address my concern about system
compatibility.    Would the following  paragraph be acceptable?  If it
isn't could you expand on your concerns with it?

"Existing Arrow implementations are focused on exposing and operating
between systems running on hardware with identical byte ordering. The
IPC/RPC metadata expresses
this orientation as a property to detect incompatibility between
different systems. In the future, systems may expect to receive data
in a byte order that is not in there native ordering.  Our continuous
integration tests only run on little endian systems."

The biggest concern I have leaving endianness open is being able to
run continuous integration tests on it (although it looks like it
might be possible to emulate big-endian hardware in travis CI).  If
anyone from the Spark community is on this ML and can chime how they
do it (or if big endian systems are something they really do support
as a first class citizen?).

As far as I know the C++ code base doesn't place any requirements on
endianness at the moment.  Looking at the java code base  the one
place that stuck out as problematic is [1].  A quick look at [1]
doesn't show anything obvious that breaks on big-endian systems,
except for the assertions that the code has to be running on a little
endian system [2].   Did you guys run into problems on BigEndian
systems with the Drill code base or were you just being overly
conservative?  Are there other places in the code that you know of
that rely on endianness?


[1] 
https://github.com/apache/arrow/blob/master/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java
[2] Curiously it looks like we also assert that the wrapped buffer is
Big Endian?

On Sat, Apr 23, 2016 at 7:24 PM, Jacques Nadeau <jacq...@apache.org> wrote:
> I'm okay with a flag but I think we should be clear about where we think
> most of the work will be (until such time as someone actually does work in
> big-endian).  Such as:
>
> "Existing Arrow implementations are focused on exposing and operating on
> little-endian data and expect that format. The IPC/RPC metadata expresses
> this orientation as a property for future expansion. In the future, systems
> may generate or expect big-endian data and will need to set the endian
> orientation as such."
>
>
>
>
>
> On Sat, Apr 23, 2016 at 4:01 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>
>> > My assumption is that most deployments for the systems we are
>> > targeting  are going to be homogenous in terms of byte ordering.  I
>> > think this can allow initial implementations to ignore support for
>> > non-native byte ordering (i.e. raise an exception if detected).
>> > Has this been other's experience?
>>
>> The assumption sounds good in the big data domain where servers are very
>> likely to be homogenous in most cases (as far as I learned), though clients
>> may be a little complex. I guess the assumption will boost Arrow much
>> easier achieving much better performance.
>>
>> > I don't see a problem adding endianness as a flag in the IPC metadata,
>> and raise exceptions if big-endian data is ever encountered for the time
>> being.
>>
>> Yeah an endianness flag would be needed in IPC to let the other side to
>> know the endianness in the wire packets since there is a potential need to
>> tweak in some cases.
>>
>> Regards,
>> Kai
>>
>> -----Original Message-----
>> From: Wes McKinney [mailto:w...@cloudera.com]
>> Sent: Saturday, April 23, 2016 11:07 PM
>> To: dev@arrow.apache.org; Micah Kornfield <emkornfi...@gmail.com>
>> Subject: Re: Byte ordering/Endianness revisited
>>
>> I don't see a problem adding endianness as a flag in the IPC metadata, and
>> raise exceptions if big-endian data is ever encountered for the time being.
>> Since big-endian hardware is so exotic nowadays, I don't think it's
>> unreasonable to expect IBM or other hardware vendors requiring big-endian
>> support to contribute the byte-swapping logic when the time comes. I
>> suppose this just means we'll have to be careful in code reviews should any
>> algorithms get written that assume a particular endianness. Will defer to
>> others' judgment on this ultimately, though.
>>
>> On Fri, Apr 22, 2016 at 11:59 PM, Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> > This was discussed on a previous thread
>> > (https://mail-archives.apache.org/mod_mbox/arrow-dev/201604.mbox/%3CCA
>> > Ka9qDkppFrJQCHsSN7CmkJCzOTAhGPERMd_u2CMZANNQGtNyw%40mail.gmail.com%3E
>> > the relevant snippet is pasted below).  But I'd like to reopen this
>> > because it appears Spark supports big endian systems (high end IBM
>> > hardware).    Right now the spec says:
>> >
>> > "The Arrow format is little endian."
>> >
>> > I'd like to change this to something like:
>> >
>> > "Algorithms written against Arrow Arrays should assume native
>> > byte-ordering. Endianness is communicated via IPC/RPC metadata and
>> > conversion to native byte-ordering is handled via IPC/RPC
>> > implementations".
>> >
>> > What do other people think?
>> >
>> > My assumption is that most deployments for the systems we are
>> > targeting  are going to be homogenous in terms of byte ordering.  I
>> > think this can allow initial implementations to ignore support for
>> > non-native byte ordering (i.e. raise an exception if detected).
>> > Has this been other's experience?
>> >
>> > Thanks,
>> > Micah
>> >
>> > Snippet from the original thread:
>> >>>
>> >>> 1.  For completeness it might be useful to add a statement that the
>> >>> byte order (endianness) is platform native.
>> >
>> >
>> >> Actually, Arrow is little-endian. It is an oversight that we haven't
>> >>documented it as such. One of the key capabilities is to push it
>> >>across the wire between separate systems without serialization (not
>> >>just IPC). As such, we have to pick an endianness. If there is a huge
>> >>need for a second big-endian encoding, we'll need to extend the spec to
>> support that as a property.
>>

Reply via email to