Seems reasonable. I was among those that originally argued for this field
but given that we haven't used it yet, I think your proposal makes sense.

+1

On Wed, Oct 18, 2017 at 5:40 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> When we originally drafted the metadata for record batches, we
> included a "page id" in the Buffer struct:
>
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L295
>
> The idea at the time was that record batches might not be colocated in
> a particular shared memory page. This might still happen in the
> future, but to this point we have not used this feature in any
> implemented.
>
> The cost of this extra 4 bytes is that the size of the Buffer struct
> with padding is 24 bytes instead of 16 bytes. In large record batches,
> this makes the record batch data header about 50% larger than it needs
> to be.
>
> I would argue that the ability to spread a record batch across
> multiple memory regions is a useful feature, but we should be solving
> that particular problem a different way, like having a separate
> "non-colocated buffer" type and record batch message type that has the
> extra page id. So when we want to use this feature, we are OK with
> paying the extra cost. But for most self-contained message use cases
> those 8 bytes in each buffer go unused.
>
> I am loathe to break the Arrow metadata at this stage, but if we agree
> about removing this field we should do it sooner rather than later. It
> may be possible to do the change in a forward compatible way if we
> were worried about breaking existing applications, but on the other
> hand I do not think we have yet made any contract about
> forward/backwards compatibility of metadata with our end users.
>
> Thanks,
> Wes
>

Reply via email to