Hey all,

I have drafted a PR to update the format docs with clarifications around
equivalence and deviations between IPC files and IPC streams:
https://github.com/apache/arrow/pull/49947

Can people take a look and make comments/suggestions?

Once there is enough consensus I will launch a formal vote to get these
changes approved.

Regards

Antoine.


On 2026/02/17 18:19:32 Antoine Pitrou wrote:
> 
> Hello,
> 
> The IPC file format is defined as the IPC stream format, preceded by a 
> header (the Arrow magic bytes) and followed by a footer (a catalog of 
> record batches, and the Arrow magic bytes). Thus, reading and writing 
> IPC files can reuse the same basic building blocks as for IPC streams 
> (this is almost trivial for writing, which is usually done sequentially).
> 
> As a consequence, IPC files practically result in valid identical IPC 
> streams (ignoring the 8 header bytes) that read as the same logical 
> contents.
> 
> However, there is no theoretical guarantee that this is always the case. 
> Consider a IPC file writer that would write record batches in reverse 
> order in the footer, compared to their sequential order in the 
> underlying stream. Or, more generally, an IPC file footer that would 
> repeat or skip some batches in the stream.
> 
> So theoretically, we cannot assume that reading an IPC file as an IPC 
> stream (after skipping the 8 header bytes) returns the intended contents.
> 
> However, it seems that it could be useful to be able to make such an 
> assumption. Hence these questions:
> 1. Do all current IPC file writers uphold this assumption?
> 2. Do we want to make it a more explicit requirement of the IPC file format?
> 
> 
> Context: I've submitted a PR 
> (https://github.com/apache/arrow/pull/49312) to enable differential 
> fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of 
> the IPC file and stream readers on the fuzzing payload.
> 
> Regards
> 
> Antoine.
> 
> 

Reply via email to