Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Micah Kornfield Sun, 16 Jun 2019 22:16:25 -0700

Hi Bogdan,

> Alright, so speaking of serialization of pyarrow.Table vs Feather, if they
> are pretty much the same, but arrow alone shouldn't
> be used to long-storage, is this also the case for Feather or can it be a
> valid option for my case?



Per Wes's e-mail on similar thread[1], once we reach 1.0.0 on the format
specification both should be useable for longer term storage (I would guess
feather would likely be preferred).  Until then I think they are both in
the same boat.  The current tentative timeline that I'm aware of is to have
a new release 0.14.0 towards the end of the month and then target 1.0.0 for
the following release (generally we release every 3-4 months).

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/96e595ce6c3cffa37bad23181abc9372c30457210a65d72d92593ce5@%3Cdev.arrow.apache.org%3E

On Sun, Jun 16, 2019 at 10:01 PM Bogdan Klichuk <klich...@gmail.com> wrote:

> Hello. Thanks for the reply!
>
>
>
> On Sun, Jun 16, 2019 at 8:40 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Micah,
>>
>> On Sun, Jun 16, 2019 at 12:16 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> > Hi Bogdan,
>> > I'm not an expert here but answers based on my understanding are below:
>> >
>> > 1) Is there something I'm missing in understanding difference between
>> > > serializing dataframe directly using PyArrow and serializing
>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of
>> > > strings, which is frequent in our cases.
>> >
>> > Since you have mixed type code the underlying data is ultimately pickled
>> > when serializing the dataframe with your code snippet:
>> >
>> https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515
>> > I
>> > think this explains the performance difference.
>> >
>>
>
> That totally explains it. I did debugged and yes, it pickles the
> dtype=object column.
>
>
>> >
>> > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
>> > > seems to "just works", but mostly people just stick to Parquet or
>> something
>> > > else.
>> >
>> > The Arrow format, in general, is NOT currently recommended for long term
>> > storage.
>> >
>>
>> I think after the 1.0.0 protocol version is released, we can begin to
>> recommend Arrow for cold storage of data (as in "you'll be able to
>> read these files in a year or two"), but design-wise it isn't intended
>> as a data warehousing format like Parquet or ORC.
>>
>> > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
>> > > storage size, but quite slower on half-text dataframes, (2-3x slower).
>> > > Could I be doing something wrong?
>> >
>> > Parquet might be trying to do some sort of encoding.  I'm not sure why
>> > Feather would be slower then pyarrow.Table (but not an expert in
>> feather).
>> >
>> > In case of mixed-type dataframes JSON still seems like an option
>> according
>> > > to our benchmarks.
>> >
>> > If you wanted to use Arrow as a format probably the right approach here
>> > would be to make a new Union column for mixed-type columns.  This would
>> > potentially slow down the write side, but make reading much quicker.
>> >
>> > 4) Feather seems to be REALLY close and similar in all benchmarks in
>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood?
>> >
>> > My understanding is that the formats are nearly identical (mostly just a
>> > difference in metadata) so the performance similarity isn't surprising.
>>
>
> Alright, so speaking of serialization of pyarrow.Table vs Feather, if they
> are pretty much the same, but arrow alone shouldn't
> be used to long-storage, is this also the case for Feather or can it be a
> valid option for my case?
>
> >
>> > On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com>
>> wrote:
>> >
>> > > Trying to come up with a solution for quick Pandas dataframes
>> serialization
>> > > and long-storage. Dataframe content is tabular, but provided by user,
>> can
>> > > be arbitrary, so might both completely text columns and completely
>> > > numeric/boolean columns.
>> > >
>> > > ## Main goals are:
>> > >
>> > > * Serialize dataframe as quickly as possible in order to dump it on
>> disk.
>> > >
>> > > * Use format, that i'll be able to load from disk later back into
>> > > dataframe.
>> > >
>> > > * Well, the least memory footprint of serialization and compact output
>> > > file.
>> > >
>> > > Have ran benchmarks comparing different serialization methods,
>> including:
>> > >
>> > > * Parquet: `df.to_parquet()`
>> > > * Feather: `df.to_feather()`
>> > > * JSON: `df.to_json()`
>> > > * CSV: `df.to_csv()`
>> > > * PyArrow: `pyarrow.default_serialization_context().serialize(df)`
>> > > * PyArrow.Table:
>> > >
>> > >
>> `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))`
>> > >
>> > > Speed of serialization and memory footprint during that are probably
>> > > biggest factors (read: get rid of data, dump it to disk asap).
>> > >
>> > > Strangely in our benchmarks serializing `pyarrow.Table` seems the most
>> > > balanced and quite fast.
>> > >
>> > > ## Questions:
>> > >
>> > > 1) Is there something I'm missing in understanding difference between
>> > > serializing dataframe directly using PyArrow and serializing
>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of
>> > > strings, which is frequent in our cases.
>> > >
>> > > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes?
>> It
>> > > seems to "just works", but mostly people just stick to Parquet or
>> something
>> > > else.
>> > >
>> > > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
>> > > storage size, but quite slower on half-text dataframes, (2-3x slower).
>> > > Could I be doing something wrong?
>> > >
>> > > In case of mixed-type dataframes JSON still seems like an option
>> according
>> > > to our benchmarks.
>> > >
>> > > 4) Feather seems to be REALLY close and similar in all benchmarks in
>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood?
>> > >
>> > > ----------------------------------------------------
>> > > ## Benchmarks:
>> > >
>> > >
>> https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0
>> > >
>> > > Since we have mixed-type columns, for the following methods we do
>> > > astype(str) for all dtype=object columns before serialization:
>> > >   * pyarrow.Table
>> > >   * feather
>> > >   * parquet
>> > >
>> > > It's also expensive but needed to be done since mixed-type columns
>> are not
>> > > supported for serialization in specified formats. Time to perform
>> this IS
>> > > INCLUDED into benchmarks.
>> > >
>> > > --
>> > > Best wishes,
>> > > Bogdan Klichuk
>> > >
>>
>
>
> --
> Best wishes,
> Bogdan Klichuk
>

Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Reply via email to