Re: Schemaless serialization

Wes McKinney Mon, 17 Feb 2020 05:15:51 -0800

hi Micah and Tewfik,

The functionality is exposed in Python, see e.g.


https://github.com/apache/arrow/blob/apache-arrow-0.16.0/python/pyarrow/tests/test_ipc.py#L685

As Micah said, very small batches aren't necessarily optimized for
compactness (for example buffers are padded to multiples of 8). Give
this a try though and see how it works

Thanks
Wes

On Sun, Feb 16, 2020 at 9:26 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I should note, it isn't necessarily just the extra metadata.  For single
> row values, there is also an overhead for padding requirements.  You should
> be able to measure this by looking at the size of the buffer you are using
> before writing any batches to the stream (I believe the schema is written
> eagerly), and subtracting that from the final size.
>
> Looking at python documentation I don't see it exposed, but the underlying
> function does exist in C++ [1]. People more familiar with the python may be
> able to offer more details.
>
> I think for this type of use-case it probably makes sense to expose it.
> Want to try to create a patch for it?
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L215
>
> On Fri, Feb 14, 2020 at 3:09 PM Tewfik Zeghmi <zeg...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > The primary language is Python.  I'm hoping the that the small overhead of
> > metadata is small compared to the schema information.
> >
> > thank you!
> >
> > On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Hi Tewfik,
> >> What language?  it is possible to serialize them separately but the right
> >> hooks might not be exposed in all languages.
> >>
> >> There is still going to be a higher overhead for single row values in
> >> Arrow
> >> compared to Avro due to metadata requirements.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Fri, Feb 14, 2020 at 1:33 PM Tewfik Zeghmi <zeg...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have a use case of creating a feature store to serve low latency
> >> traffic.
> >> > Given a key, we need the ability to save and read a feature vector in a
> >> low
> >> > latency Key Value store. Serializing an Arrow table with one row is
> >> takes
> >> > 1344 bytes, while the same singular row serialized with AVRO without the
> >> > schema uses 236 bytes.
> >> >
> >> > Is it possible to save serialize an Arrow table/RecordBatch
> >> independently
> >> > of the schema? Ideally, we'd like to serialize the schema once and not
> >> > along with every feature key, then be able to read the RecordBatch with
> >> the
> >> > schema.
> >> >
> >> > thank you!
> >> >
> >>
> >
> >
> > --
> > Taleb Tewfik Zeghmi
> >

Re: Schemaless serialization

Reply via email to