Deephaven and I are very supportive of "upgrading" the value half of the kv
pair to a byte vector. What is the best way to find out if there is
sufficient interest?


I've been stewing on the ideas here around schema evolution, and I realize
the specific feature I am missing is the ability to encode that a field
(i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
empty/has-no-data in O(0) cost (yes; for free).

Might there be interest in adding a "field_id" to the FieldNode (which is
encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
upgrade (by either keying off of 0, or explicitly set the field default to
-1) which would allow the sender to "skip" fields that have 1) FieldNode
length of zero, and 2) all Buffer's associated at that level (and further
nested) are also equally empty (i.e. Buffer length is zero).

I understand this concept slightly interferes with RecordBatch's `length`
field, and that many implementations use that length to resize the
root-level FieldNodes. The use-case I have in mind has different logical
lengths per field node; current implementations require sending a
RecordBatch length of the max length across all root level field nodes. I
believe this requires a copy of data whenever a field node is too short; I
don't know if there is a decent solution to this slight inefficiency. I am
bringing it up because if "skipping a field node when it is empty" is a
feature, then we may not want to allocate space for those nodes given that
the record batch length will likely be greater than zero.

On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <wesmck...@gmail.com> wrote:

> On Wed, Jul 7, 2021 at 2:53 PM David Li <apa...@lidavidm.me> wrote:
> >
> > From the Flatbuffers internals doc[1] it appears they are the same:
> "Strings are simply a vector of bytes, and are always null-terminated."
>
> I see. I took a look at flatbuffers.h, and it appears that changing
> this field from string to [byte] would be backward-compatible and
> forward-compatible except with code that expects a null terminator.
> This is something we could discuss separately if there were enough
> interest.
>
> > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> >
> > -David
> >
> > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > > Right, I had wanted to focus the discussion on Flight as I think
> schema
> > > > > evolution or multiplexing streams (more so the latter) is a
> property of the
> > > > > transport and not the stream format itself. If we are leaning
> towards just
> > > > > schema evolution then maybe it makes sense to discuss it for the
> IPC stream
> > > > > format and leverage that in Flight. I'd be interested in what
> others think.
> > > >
> > > > I tend to agree, I think stream multiplexing is likely a transport
> level
> > > > issue.  IMO I think schema evolution should be consistent with the
> IPC
> > > > stream format  and flight.
> > > >
> > > >
> > > > > Nate: it may be worth starting a separate discussion about more
> general
> > > > > metadata in the IPC message. I'm not aware of why key-value
> metadata was
> > > > > chosen/if opaque bytes were considered in the past.
> > > >
> > > >
> > > > I think  this was an unfortunate design of the key value metadata in
> > > > Schema.fbs, but I don't think I was around when this decision was
> made.
> > >
> > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > string for the value in the KeyValue metadata — I think this was more
> > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > to require binary data to be base64-encoded — this is something that
> > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > for Parquet, for example). Is the binary representation of [byte]
> > > different from string?
> > >
> > >
> > >
> > > > Side Question: Why isn't the IPC stream format a series of the flight
> > > > > protobufs?
> > > >
> > > > In addition to what David said, protobufs can't be read directly
> from a
> > > > memory-mapped file (they need decoding).  This was one of the design
> > > > considerations of using flatbuffers and IPC Stream/File format.
> > > >
> > > > I was thinking Micah's comment is more that whatever we do, it
> should be
> > > > > clearly specified and edge cases should be considered, especially
> if we
> > > > > might want to 'backport' this into the stream format later.
> > > >
> > > >
> > > > Yes, for dictionaries we just need to be careful to define semantics
> and
> > > > ensure implementations are validating them with regards to
> dictionaries.
> > > > There likely isn't any need to change current implementations though.
> > > >
> > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <lidav...@apache.org>
> wrote:
> > > >
> > > > > Right, I had wanted to focus the discussion on Flight as I think
> schema
> > > > > evolution or multiplexing streams (more so the latter) is a
> property of the
> > > > > transport and not the stream format itself. If we are leaning
> towards just
> > > > > schema evolution then maybe it makes sense to discuss it for the
> IPC stream
> > > > > format and leverage that in Flight. I'd be interested in what
> others think.
> > > > >
> > > > > Especially if we are looking at multiplexing streams - I would
> wonder if
> > > > > that's actually better served by making it easier to implement
> using the
> > > > > Flight implementation as it stands (by managing concurrent RPC
> calls and/or
> > > > > performing the union-of-structs encoding trick for you), instead
> of having
> > > > > to change the protocol.
> > > > >
> > > > > Nate: it may be worth starting a separate discussion about more
> general
> > > > > metadata in the IPC message. I'm not aware of why key-value
> metadata was
> > > > > chosen/if opaque bytes were considered in the past. Off the top of
> my head
> > > > > if it's for on-disk storage and fully application-defined it may
> make sense
> > > > > to store as a separate file alongside the Arrow file (indexed by
> record
> > > > > batch index) where you can take advantage of whatever format is
> most
> > > > > suitable.
> > > > >
> > > > > -David
> > > > >
> > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > Hi guys,
> > > > > >
> > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was to
> add this
> > > > > > feature starting from the IPC(I moved initial write up steps to
> the
> > > > > bottom
> > > > > > of the doc). Afterwards David suggested focusing on Flight and
> that's how
> > > > > > we ended up with the protobufs change in the proposal. This
> being said I
> > > > > do
> > > > > > think that the place where this should be impemented is a good
> question
> > > > > on
> > > > > > its own. Maybe it makes sense to have this kind of a feature in
> IPC and
> > > > > > somehow use it in Flight, maybe not.
> > > > > > 2. The point about dictionaries deserves a dedicated section in
> the
> > > > > > proposal. Nate and David brought it up and shared some insights.
> I'll try
> > > > > > to aggregate them and we can continue the discussion form there.
> > > > > >
> > > > > > Cheers,
> > > > > > Gosh
> > > > > >
> > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > natebauernfe...@deephaven.io>
> > > > > > wrote:
> > > > > >
> > > > > > > >
> > > > > > > > > > makes it more difficult to bring schema evolution back
> into the
> > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> > > > > > > > >
> > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> > > > > protobufs.
> > > > > > > Can
> > > > > > > > > you help me understand how difficult it would be to bring
> the
> > > > > > > `schema_id`
> > > > > > > > > approach to the IPC stream format?
> > > > > > > >
> > > > > > > > I thought we were talking solely about the Flight Protobuf
> > > > > definitions -
> > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> about the
> > > > > > > > Protobufs).
> > > > > > > >
> > > > > > >
> > > > > > > I somehow missed that schema_id is being added to protobuf in
> the
> > > > > document.
> > > > > > > It feels to me that the schema_id is a property that would
> ideally only
> > > > > > > apply to the RecordBatch. I better understand Micah's
> dictionary
> > > > > concerns,
> > > > > > > now, too.
> > > > > > >
> > > > > > > > Side Question: Why isn't the IPC stream format a series of
> the flight
> > > > > > > > > protobufs? It's a real shame that there is no standard way
> to
> > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> ignoring the
> > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > >
> > > > > > > > The IPC format was defined long before Flight, and Flight's
> > > > > app_metadata
> > > > > > > > was added after Flight's initial definition. Note an IPC
> message does
> > > > > > > have
> > > > > > > > a provision for key-value metadata, though I think APIs for
> that are
> > > > > not
> > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> despite my
> > > > > comments
> > > > > > > > there perhaps we need to unify or at least consider how
> Flight's
> > > > > > > > app_metadata relates to the IPC message custom_metadata. Also
> > > > > perhaps see
> > > > > > > > ARROW-1059.)
> > > > > > > >
> > > > > > >
> > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> strings are
> > > > > only
> > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> opaque
> > > > > bytes.
> > > > > > > The latter is a bit more useful.
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > >
> > > > >
> > >
>


--

Reply via email to