hi John,

This is off topic, so please feel free to start a discussion thread
and present a detailed proposal. I believe your use case can be
addressed by pre-allocating record batches and maintaining application
level metadata about what portion of the record batches has been
"filled" (so the unfilled records can be dropped by slicing). I don't
think any change to the binary protocol is warranted

Thanks,
Wes

On Sun, Jun 30, 2019 at 9:03 PM John Muehlhausen <j...@jgm.org> wrote:
>
> If there is going to be a breaking change to the IPC format, I'd appreciate
> some discussion about an idea I had for RecordBatch metadata.  I previously
> promised to create a discussion thread with an initial write-up but have
> not yet done so.  I will try to do this tomorrow.  (The basic idea is to
> have readers read only N records of a RecordBatch rather than all of them,
> where in the usual case these row counts are the same.  This opens the door
> to reading partially populated RecordBatches which will be extremely useful
> in low-latency streaming applications.)
>
> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > At least on the read-side we can make this detectable by using something
> > like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> > would need some sort of default mode that we could flip on/off if we wanted
> > to maintain compatibility.
> >
> > I should say I think we should fix it.  Undefined behavior is unpaid debt
> > that might never be collected or might cause things to fail in difficult to
> > diagnose ways. And pre-1.0.0 is definitely the time.
> >
> > -Micah
> >
> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > > hi Micah,
> > > >
> > > > This is definitely unfortunate, I wish we had realized the potential
> > > > implications of having the Flatbuffer message start on a 4-byte
> > > > (rather than 8-byte) boundary. The cost of making such a change now
> > > > would be pretty high since all readers and writers in all languages
> > > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > > > bump is the last opportunity we have to make a change like this, so we
> > > > might as well discuss it now. Note that particular implementations
> > > > could implement compatibility functions to handle the 4 to 8 byte
> > > > change so that old clients can still be understood. We'd probably want
> > > > to do this in C++, for example, since users would pretty quickly
> > > > acquire a new pyarrow version in Spark applications while they are
> > > > stuck on an old version of the Java libraries.
> > >
> > > NB such a backwards compatibility fix would not be forward-compatible,
> > > so the PySpark users would need to use a pinned version of pyarrow
> > > until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > >
> > > >
> > > > - Wes
> > > >
> > > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfi...@gmail.com
> > >
> > > wrote:
> > > > >
> > > > > While working on trying to fix undefined behavior for unaligned
> > memory
> > > > > accesses [1], I ran into an issue with the IPC specification [2]
> > which
> > > > > prevents us from ever achieving zero-copy memory mapping and having
> > > aligned
> > > > > accesses (i.e. clean UBSan runs).
> > > > >
> > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > > >
> > > > > In the IPC format we align each message to 8-byte boundaries.  We
> > then
> > > > > write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > >
> > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > int32_t
> > > > > for length.  However, any fix essentially breaks all previous client
> > > > > library versions or incurs a memory copy.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > [2] https://arrow.apache.org/docs/ipc.html
> > >
> >

Reply via email to