Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Bogdan Klichuk Sun, 16 Jun 2019 22:20:27 -0700

Nice, hank you for the approximate timeline!

On Mon, Jun 17, 2019 at 1:15 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:


> Hi Bogdan,
>
>> Alright, so speaking of serialization of pyarrow.Table vs Feather, if
>> they are pretty much the same, but arrow alone shouldn't
>> be used to long-storage, is this also the case for Feather or can it be a
>> valid option for my case?
>
>
> Per Wes's e-mail on similar thread[1], once we reach 1.0.0 on the format
> specification both should be useable for longer term storage (I would guess
> feather would likely be preferred).  Until then I think they are both in
> the same boat.  The current tentative timeline that I'm aware of is to have
> a new release 0.14.0 towards the end of the month and then target 1.0.0 for
> the following release (generally we release every 3-4 months).
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/96e595ce6c3cffa37bad23181abc9372c30457210a65d72d92593ce5@%3Cdev.arrow.apache.org%3E
>
> On Sun, Jun 16, 2019 at 10:01 PM Bogdan Klichuk <klich...@gmail.com>
> wrote:
>
>> Hello. Thanks for the reply!
>>
>>
>>
>> On Sun, Jun 16, 2019 at 8:40 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> hi Micah,
>>>
>>> On Sun, Jun 16, 2019 at 12:16 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Bogdan,
>>> > I'm not an expert here but answers based on my understanding are below:
>>> >
>>> > 1) Is there something I'm missing in understanding difference between
>>> > > serializing dataframe directly using PyArrow and serializing
>>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of
>>> > > strings, which is frequent in our cases.
>>> >
>>> > Since you have mixed type code the underlying data is ultimately
>>> pickled
>>> > when serializing the dataframe with your code snippet:
>>> >
>>> https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515
>>> > I
>>> > think this explains the performance difference.
>>> >
>>>
>>
>> That totally explains it. I did debugged and yes, it pickles the
>> dtype=object column.
>>
>>
>>> >
>>> > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It
>>> > > seems to "just works", but mostly people just stick to Parquet or
>>> something
>>> > > else.
>>> >
>>> > The Arrow format, in general, is NOT currently recommended for long
>>> term
>>> > storage.
>>> >
>>>
>>> I think after the 1.0.0 protocol version is released, we can begin to
>>> recommend Arrow for cold storage of data (as in "you'll be able to
>>> read these files in a year or two"), but design-wise it isn't intended
>>> as a data warehousing format like Parquet or ORC.
>>>
>>> > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
>>> > > storage size, but quite slower on half-text dataframes, (2-3x
>>> slower).
>>> > > Could I be doing something wrong?
>>> >
>>> > Parquet might be trying to do some sort of encoding.  I'm not sure why
>>> > Feather would be slower then pyarrow.Table (but not an expert in
>>> feather).
>>> >
>>> > In case of mixed-type dataframes JSON still seems like an option
>>> according
>>> > > to our benchmarks.
>>> >
>>> > If you wanted to use Arrow as a format probably the right approach here
>>> > would be to make a new Union column for mixed-type columns.  This would
>>> > potentially slow down the write side, but make reading much quicker.
>>> >
>>> > 4) Feather seems to be REALLY close and similar in all benchmarks in
>>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood?
>>> >
>>> > My understanding is that the formats are nearly identical (mostly just
>>> a
>>> > difference in metadata) so the performance similarity isn't surprising.
>>>
>>
>> Alright, so speaking of serialization of pyarrow.Table vs Feather, if
>> they are pretty much the same, but arrow alone shouldn't
>> be used to long-storage, is this also the case for Feather or can it be a
>> valid option for my case?
>>
>> >
>>> > On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com>
>>> wrote:
>>> >
>>> > > Trying to come up with a solution for quick Pandas dataframes
>>> serialization
>>> > > and long-storage. Dataframe content is tabular, but provided by
>>> user, can
>>> > > be arbitrary, so might both completely text columns and completely
>>> > > numeric/boolean columns.
>>> > >
>>> > > ## Main goals are:
>>> > >
>>> > > * Serialize dataframe as quickly as possible in order to dump it on
>>> disk.
>>> > >
>>> > > * Use format, that i'll be able to load from disk later back into
>>> > > dataframe.
>>> > >
>>> > > * Well, the least memory footprint of serialization and compact
>>> output
>>> > > file.
>>> > >
>>> > > Have ran benchmarks comparing different serialization methods,
>>> including:
>>> > >
>>> > > * Parquet: `df.to_parquet()`
>>> > > * Feather: `df.to_feather()`
>>> > > * JSON: `df.to_json()`
>>> > > * CSV: `df.to_csv()`
>>> > > * PyArrow: `pyarrow.default_serialization_context().serialize(df)`
>>> > > * PyArrow.Table:
>>> > >
>>> > >
>>> `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))`
>>> > >
>>> > > Speed of serialization and memory footprint during that are probably
>>> > > biggest factors (read: get rid of data, dump it to disk asap).
>>> > >
>>> > > Strangely in our benchmarks serializing `pyarrow.Table` seems the
>>> most
>>> > > balanced and quite fast.
>>> > >
>>> > > ## Questions:
>>> > >
>>> > > 1) Is there something I'm missing in understanding difference between
>>> > > serializing dataframe directly using PyArrow and serializing
>>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of
>>> > > strings, which is frequent in our cases.
>>> > >
>>> > > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes?
>>> It
>>> > > seems to "just works", but mostly people just stick to Parquet or
>>> something
>>> > > else.
>>> > >
>>> > > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory /
>>> > > storage size, but quite slower on half-text dataframes, (2-3x
>>> slower).
>>> > > Could I be doing something wrong?
>>> > >
>>> > > In case of mixed-type dataframes JSON still seems like an option
>>> according
>>> > > to our benchmarks.
>>> > >
>>> > > 4) Feather seems to be REALLY close and similar in all benchmarks in
>>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood?
>>> > >
>>> > > ----------------------------------------------------
>>> > > ## Benchmarks:
>>> > >
>>> > >
>>> https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0
>>> > >
>>> > > Since we have mixed-type columns, for the following methods we do
>>> > > astype(str) for all dtype=object columns before serialization:
>>> > >   * pyarrow.Table
>>> > >   * feather
>>> > >   * parquet
>>> > >
>>> > > It's also expensive but needed to be done since mixed-type columns
>>> are not
>>> > > supported for serialization in specified formats. Time to perform
>>> this IS
>>> > > INCLUDED into benchmarks.
>>> > >
>>> > > --
>>> > > Best wishes,
>>> > > Bogdan Klichuk
>>> > >
>>>
>>
>>
>> --
>> Best wishes,
>> Bogdan Klichuk
>>
>

-- 
Best wishes,
Bogdan Klichuk

Re: Using pyarrow.Table for long-term storage of pandas DataFrames

Reply via email to