I should note, it isn't necessarily just the extra metadata.  For single
row values, there is also an overhead for padding requirements.  You should
be able to measure this by looking at the size of the buffer you are using
before writing any batches to the stream (I believe the schema is written
eagerly), and subtracting that from the final size.

Looking at python documentation I don't see it exposed, but the underlying
function does exist in C++ [1]. People more familiar with the python may be
able to offer more details.

I think for this type of use-case it probably makes sense to expose it.
Want to try to create a patch for it?

Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L215

On Fri, Feb 14, 2020 at 3:09 PM Tewfik Zeghmi <zeg...@gmail.com> wrote:

> Hi Micah,
>
> The primary language is Python.  I'm hoping the that the small overhead of
> metadata is small compared to the schema information.
>
> thank you!
>
> On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Tewfik,
>> What language?  it is possible to serialize them separately but the right
>> hooks might not be exposed in all languages.
>>
>> There is still going to be a higher overhead for single row values in
>> Arrow
>> compared to Avro due to metadata requirements.
>>
>> Thanks,
>> Micah
>>
>> On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <zeg...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I have a use case of creating a feature store to serve low latency
>> traffic.
>> > Given a key, we need the ability to save and read a feature vector in a
>> low
>> > latency Key Value store. Serializing an Arrow table with one row is
>> takes
>> > 1344 bytes, while the same singular row serialized with AVRO without the
>> > schema uses 236 bytes.
>> >
>> > Is it possible to save serialize an Arrow table/RecordBatch
>> independently
>> > of the schema? Ideally, we'd like to serialize the schema once and not
>> > along with every feature key, then be able to read the RecordBatch with
>> the
>> > schema.
>> >
>> > thank you!
>> >
>>
>
>
> --
> Taleb Tewfik Zeghmi
>

Reply via email to