hi John, This is off topic, so please feel free to start a discussion thread and present a detailed proposal. I believe your use case can be addressed by pre-allocating record batches and maintaining application level metadata about what portion of the record batches has been "filled" (so the unfilled records can be dropped by slicing). I don't think any change to the binary protocol is warranted
Thanks, Wes On Sun, Jun 30, 2019 at 9:03 PM John Muehlhausen <j...@jgm.org> wrote: > > If there is going to be a breaking change to the IPC format, I'd appreciate > some discussion about an idea I had for RecordBatch metadata. I previously > promised to create a discussion thread with an initial write-up but have > not yet done so. I will try to do this tomorrow. (The basic idea is to > have readers read only N records of a RecordBatch rather than all of them, > where in the usual case these row counts are the same. This opens the door > to reading partially populated RecordBatches which will be extremely useful > in low-latency streaming applications.) > > On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > At least on the read-side we can make this detectable by using something > > like <0xffffffff><int32_t size> instead of int64_t. On the write side we > > would need some sort of default mode that we could flip on/off if we wanted > > to maintain compatibility. > > > > I should say I think we should fix it. Undefined behavior is unpaid debt > > that might never be collected or might cause things to fail in difficult to > > diagnose ways. And pre-1.0.0 is definitely the time. > > > > -Micah > > > > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > > hi Micah, > > > > > > > > This is definitely unfortunate, I wish we had realized the potential > > > > implications of having the Flatbuffer message start on a 4-byte > > > > (rather than 8-byte) boundary. The cost of making such a change now > > > > would be pretty high since all readers and writers in all languages > > > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version > > > > bump is the last opportunity we have to make a change like this, so we > > > > might as well discuss it now. Note that particular implementations > > > > could implement compatibility functions to handle the 4 to 8 byte > > > > change so that old clients can still be understood. We'd probably want > > > > to do this in C++, for example, since users would pretty quickly > > > > acquire a new pyarrow version in Spark applications while they are > > > > stuck on an old version of the Java libraries. > > > > > > NB such a backwards compatibility fix would not be forward-compatible, > > > so the PySpark users would need to use a pinned version of pyarrow > > > until Spark upgraded to Arrow 1.0.0. Maybe that's OK > > > > > > > > > > > - Wes > > > > > > > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfi...@gmail.com > > > > > > wrote: > > > > > > > > > > While working on trying to fix undefined behavior for unaligned > > memory > > > > > accesses [1], I ran into an issue with the IPC specification [2] > > which > > > > > prevents us from ever achieving zero-copy memory mapping and having > > > aligned > > > > > accesses (i.e. clean UBSan runs). > > > > > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > > > accesses. > > > > > > > > > > In the IPC format we align each message to 8-byte boundaries. We > > then > > > > > write a int32_t integer to to denote the size of flat buffer > > metadata, > > > > > followed immediately by the flatbuffer metadata. This means the > > > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > > > > > Do people care? A simple fix would be to use int64_t instead of > > > int32_t > > > > > for length. However, any fix essentially breaks all previous client > > > > > library versions or incurs a memory copy. > > > > > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > > > [2] https://arrow.apache.org/docs/ipc.html > > > > >