At least on the read-side we can make this detectable by using something
like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
would need some sort of default mode that we could flip on/off if we wanted
to maintain compatibility.

I should say I think we should fix it.  Undefined behavior is unpaid debt
that might never be collected or might cause things to fail in difficult to
diagnose ways. And pre-1.0.0 is definitely the time.

-Micah

On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hi Micah,
> >
> > This is definitely unfortunate, I wish we had realized the potential
> > implications of having the Flatbuffer message start on a 4-byte
> > (rather than 8-byte) boundary. The cost of making such a change now
> > would be pretty high since all readers and writers in all languages
> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > bump is the last opportunity we have to make a change like this, so we
> > might as well discuss it now. Note that particular implementations
> > could implement compatibility functions to handle the 4 to 8 byte
> > change so that old clients can still be understood. We'd probably want
> > to do this in C++, for example, since users would pretty quickly
> > acquire a new pyarrow version in Spark applications while they are
> > stuck on an old version of the Java libraries.
>
> NB such a backwards compatibility fix would not be forward-compatible,
> so the PySpark users would need to use a pinned version of pyarrow
> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>
> >
> > - Wes
> >
> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> > >
> > > While working on trying to fix undefined behavior for unaligned memory
> > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> > > accesses (i.e. clean UBSan runs).
> > >
> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> > >
> > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > followed immediately  by the flatbuffer metadata.  This means the
> > > flatbuffer metadata will never be 8 byte aligned.
> > >
> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> > > for length.  However, any fix essentially breaks all previous client
> > > library versions or incurs a memory copy.
> > >
> > > [1] https://github.com/apache/arrow/pull/4757
> > > [2] https://arrow.apache.org/docs/ipc.html
>

Reply via email to