Hi Dewey,

While thinking about this in a café in Amsterdam, another idea came to mind: 
most ApplicationData does have structure.

It could be interesting to support multiplexing multiple IPC streams over the 
same socket. One way to do this would be to tag IPC messages with a destination 
stream ID, and have the IPC reader/writer emit messages annotated with that ID. 
On the receiving side, read_next() could yield messages from any active stream, 
leaving it to the user to interpret both the stream ID and the message itself.

This is more of a thought experiment than a concrete proposal, since it would 
likely add significant complexity—but it felt related enough to mention.

Best,

Rusty

On Tue, Feb 3, 2026, at 8:58 PM, Dewey Dunnington wrote:
> Just a note that I think a dedicated ApplicationData message is a great
> idea. Many spatial file formats embed a spatial index, which is too large
> for schema metadata...in many cases it already exists in data I'd like to
> stream over IPC and there's no good place to put it so it just gets dropped
> by the producer and recomputed by the consumer. I had considered at one
> point prototyping an arrow-based spatial format that placed this type of
> data in an Arrow file with the extra spatial information after the EOS and
> before the file footer; however, ApplicationData would be a much cleaner
> approach. There are many instances of custom file formats built on top of
> SQLite and I wonder if ApplicationData would open up something like that
> for Arrow IPC (beyond just my spatial concept).
>
> Cheers,
>
> -dewey
>
> On Tue, Feb 3, 2026 at 1:28 PM Rusty Conover <[email protected]> wrote:
>
>> Hi Antoine,
>>
>> It is nice to hear from you!
>>
>> > (I would perhaps also call it "application data" or something)
>>
>> I’m happy with ApplicationData as the name.
>>
>> > On the face of it, this looks like a reasonable idea, though I wonder if
>> > it should be a separate message type *or* an optional field carried
>> > together in RecordBatches.
>>
>> The main issue with carrying this in RecordBatch metadata is ordering.
>> While IPC already supports `custom_metadata` via `write_batch` (which I’ve
>> been using), that approach assumes the application data can be attached to
>> a specific batch.
>>
>> In some cases, the application data and record batches are produced
>> independently and cannot be cleanly associated. A concrete example is
>> interleaving stderr output (arbitrary log messages) with record batches
>> written to stdout, while preserving a single ordered IPC stream.
>>
>> I experimented with using zero-row record batches as a workaround, but
>> this is inefficient: even with no rows, the serialized message size grows
>> with schema complexity. I’ve measured this across several schemas; details
>> and code are here:
>>
>> https://gist.github.com/rustyconover/6ff8cbd93369735287d80ae60436379e
>>
>> In short, zero-row batches can cost anywhere from ~120 bytes for simple
>> schemas to ~450+ bytes for more complex ones, which makes this approach
>> unattractive when trying to minimize bytes on the wire.
>>
>> For these reasons, a distinct IPC message type for application data seems
>> like the cleanest solution. I’d be very interested in whether others have
>> run into the need for this as well.
>>
>> Rusty
>>
>>
>> On Tue, Feb 3, 2026, at 5:58 PM, Antoine Pitrou wrote:
>> > Hi Rusty,
>> >
>> >
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 03/02/2026 à 17:31, Rusty Conover a écrit :
>> >> Hi Arrow Friends,
>> >>
>> >> I’ve really appreciated Arrow Flight’s ability to carry custom metadata
>> messages alongside record batches. In some of my current work, however, I’m
>> dealing with Arrow IPC streams that are *not* sent via Flight, and I’d like
>> to have a comparable capability there as well.
>> >>
>> >> To support this, I’d like to propose adding a new IPC message
>> type—tentatively named `*OpaqueBytes*`—that would allow arbitrary bytes to
>> be embedded directly within IPC streams. IPC readers that do not understand
>> this message type could safely ignore it, preserving compatibility.
>> >>
>> >> My motivation is to enable multiplexing of auxiliary messages within a
>> stream that otherwise consists of schemas, dictionaries, and record
>> batches. A concrete example would be interleaving logging or signaling
>> messages with record batches. Today, I’m approximating this by emitting
>> zero-row record batches with binary metadata, but this approach is awkward
>> and incurs unnecessary overhead due to schema complexity.
>> >>
>> >> An `OpaqueBytes` IPC message type could enable a range of use cases,
>> including (but not limited to) logging, flow control, signaling, and other
>> auxiliary communication needs that don’t naturally map to record batches.
>> >>
>> >> I briefly discussed this idea a few weeks ago on the Apache Arrow call,
>> but wanted to share it here to reach a broader audience and gather more
>> feedback.
>> >>
>> >> In addition to the message type itself, I’d also be interested in
>> hearing thoughts on how PyArrow’s interfaces might be extended to allow
>> users to read and write these arbitrary messages as part of existing IPC
>> stream readers and writers.
>> >>
>> >> Looking forward to your thoughts and discussion.
>> >>
>> >> Kind regards,
>> >> Rusty
>>

Reply via email to