Re: [DISCUSS] IPC MessageType - OpaqueBytes

Dewey Dunnington Tue, 03 Feb 2026 11:58:45 -0800

Just a note that I think a dedicated ApplicationData message is a great
idea. Many spatial file formats embed a spatial index, which is too large
for schema metadata...in many cases it already exists in data I'd like to
stream over IPC and there's no good place to put it so it just gets dropped
by the producer and recomputed by the consumer. I had considered at one
point prototyping an arrow-based spatial format that placed this type of
data in an Arrow file with the extra spatial information after the EOS and
before the file footer; however, ApplicationData would be a much cleaner
approach. There are many instances of custom file formats built on top of
SQLite and I wonder if ApplicationData would open up something like that
for Arrow IPC (beyond just my spatial concept).


Cheers,

-dewey

On Tue, Feb 3, 2026 at 1:28 PM Rusty Conover <[email protected]> wrote:

> Hi Antoine,
>
> It is nice to hear from you!
>
> > (I would perhaps also call it "application data" or something)
>
> I’m happy with ApplicationData as the name.
>
> > On the face of it, this looks like a reasonable idea, though I wonder if
> > it should be a separate message type *or* an optional field carried
> > together in RecordBatches.
>
> The main issue with carrying this in RecordBatch metadata is ordering.
> While IPC already supports `custom_metadata` via `write_batch` (which I’ve
> been using), that approach assumes the application data can be attached to
> a specific batch.
>
> In some cases, the application data and record batches are produced
> independently and cannot be cleanly associated. A concrete example is
> interleaving stderr output (arbitrary log messages) with record batches
> written to stdout, while preserving a single ordered IPC stream.
>
> I experimented with using zero-row record batches as a workaround, but
> this is inefficient: even with no rows, the serialized message size grows
> with schema complexity. I’ve measured this across several schemas; details
> and code are here:
>
> https://gist.github.com/rustyconover/6ff8cbd93369735287d80ae60436379e
>
> In short, zero-row batches can cost anywhere from ~120 bytes for simple
> schemas to ~450+ bytes for more complex ones, which makes this approach
> unattractive when trying to minimize bytes on the wire.
>
> For these reasons, a distinct IPC message type for application data seems
> like the cleanest solution. I’d be very interested in whether others have
> run into the need for this as well.
>
> Rusty
>
>
> On Tue, Feb 3, 2026, at 5:58 PM, Antoine Pitrou wrote:
> > Hi Rusty,
> >
> >
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/02/2026 à 17:31, Rusty Conover a écrit :
> >> Hi Arrow Friends,
> >>
> >> I’ve really appreciated Arrow Flight’s ability to carry custom metadata
> messages alongside record batches. In some of my current work, however, I’m
> dealing with Arrow IPC streams that are *not* sent via Flight, and I’d like
> to have a comparable capability there as well.
> >>
> >> To support this, I’d like to propose adding a new IPC message
> type—tentatively named `*OpaqueBytes*`—that would allow arbitrary bytes to
> be embedded directly within IPC streams. IPC readers that do not understand
> this message type could safely ignore it, preserving compatibility.
> >>
> >> My motivation is to enable multiplexing of auxiliary messages within a
> stream that otherwise consists of schemas, dictionaries, and record
> batches. A concrete example would be interleaving logging or signaling
> messages with record batches. Today, I’m approximating this by emitting
> zero-row record batches with binary metadata, but this approach is awkward
> and incurs unnecessary overhead due to schema complexity.
> >>
> >> An `OpaqueBytes` IPC message type could enable a range of use cases,
> including (but not limited to) logging, flow control, signaling, and other
> auxiliary communication needs that don’t naturally map to record batches.
> >>
> >> I briefly discussed this idea a few weeks ago on the Apache Arrow call,
> but wanted to share it here to reach a broader audience and gather more
> feedback.
> >>
> >> In addition to the message type itself, I’d also be interested in
> hearing thoughts on how PyArrow’s interfaces might be extended to allow
> users to read and write these arbitrary messages as part of existing IPC
> stream readers and writers.
> >>
> >> Looking forward to your thoughts and discussion.
> >>
> >> Kind regards,
> >> Rusty
>

Re: [DISCUSS] IPC MessageType - OpaqueBytes

Reply via email to