Nice, hank you for the approximate timeline! On Mon, Jun 17, 2019 at 1:15 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
> Hi Bogdan, > >> Alright, so speaking of serialization of pyarrow.Table vs Feather, if >> they are pretty much the same, but arrow alone shouldn't >> be used to long-storage, is this also the case for Feather or can it be a >> valid option for my case? > > > Per Wes's e-mail on similar thread[1], once we reach 1.0.0 on the format > specification both should be useable for longer term storage (I would guess > feather would likely be preferred). Until then I think they are both in > the same boat. The current tentative timeline that I'm aware of is to have > a new release 0.14.0 towards the end of the month and then target 1.0.0 for > the following release (generally we release every 3-4 months). > > Thanks, > Micah > > [1] > https://lists.apache.org/thread.html/96e595ce6c3cffa37bad23181abc9372c30457210a65d72d92593ce5@%3Cdev.arrow.apache.org%3E > > On Sun, Jun 16, 2019 at 10:01 PM Bogdan Klichuk <klich...@gmail.com> > wrote: > >> Hello. Thanks for the reply! >> >> >> >> On Sun, Jun 16, 2019 at 8:40 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >>> hi Micah, >>> >>> On Sun, Jun 16, 2019 at 12:16 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> > >>> > Hi Bogdan, >>> > I'm not an expert here but answers based on my understanding are below: >>> > >>> > 1) Is there something I'm missing in understanding difference between >>> > > serializing dataframe directly using PyArrow and serializing >>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of >>> > > strings, which is frequent in our cases. >>> > >>> > Since you have mixed type code the underlying data is ultimately >>> pickled >>> > when serializing the dataframe with your code snippet: >>> > >>> https://github.com/apache/arrow/blob/27daba047533bf4e9e1cf4485cc9d4bc5c416ec9/python/pyarrow/pandas_compat.py#L515 >>> > I >>> > think this explains the performance difference. >>> > >>> >> >> That totally explains it. I did debugged and yes, it pickles the >> dtype=object column. >> >> >>> > >>> > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It >>> > > seems to "just works", but mostly people just stick to Parquet or >>> something >>> > > else. >>> > >>> > The Arrow format, in general, is NOT currently recommended for long >>> term >>> > storage. >>> > >>> >>> I think after the 1.0.0 protocol version is released, we can begin to >>> recommend Arrow for cold storage of data (as in "you'll be able to >>> read these files in a year or two"), but design-wise it isn't intended >>> as a data warehousing format like Parquet or ORC. >>> >>> > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory / >>> > > storage size, but quite slower on half-text dataframes, (2-3x >>> slower). >>> > > Could I be doing something wrong? >>> > >>> > Parquet might be trying to do some sort of encoding. I'm not sure why >>> > Feather would be slower then pyarrow.Table (but not an expert in >>> feather). >>> > >>> > In case of mixed-type dataframes JSON still seems like an option >>> according >>> > > to our benchmarks. >>> > >>> > If you wanted to use Arrow as a format probably the right approach here >>> > would be to make a new Union column for mixed-type columns. This would >>> > potentially slow down the write side, but make reading much quicker. >>> > >>> > 4) Feather seems to be REALLY close and similar in all benchmarks in >>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood? >>> > >>> > My understanding is that the formats are nearly identical (mostly just >>> a >>> > difference in metadata) so the performance similarity isn't surprising. >>> >> >> Alright, so speaking of serialization of pyarrow.Table vs Feather, if >> they are pretty much the same, but arrow alone shouldn't >> be used to long-storage, is this also the case for Feather or can it be a >> valid option for my case? >> >> > >>> > On Wed, Jun 12, 2019 at 9:12 AM Bogdan Klichuk <klich...@gmail.com> >>> wrote: >>> > >>> > > Trying to come up with a solution for quick Pandas dataframes >>> serialization >>> > > and long-storage. Dataframe content is tabular, but provided by >>> user, can >>> > > be arbitrary, so might both completely text columns and completely >>> > > numeric/boolean columns. >>> > > >>> > > ## Main goals are: >>> > > >>> > > * Serialize dataframe as quickly as possible in order to dump it on >>> disk. >>> > > >>> > > * Use format, that i'll be able to load from disk later back into >>> > > dataframe. >>> > > >>> > > * Well, the least memory footprint of serialization and compact >>> output >>> > > file. >>> > > >>> > > Have ran benchmarks comparing different serialization methods, >>> including: >>> > > >>> > > * Parquet: `df.to_parquet()` >>> > > * Feather: `df.to_feather()` >>> > > * JSON: `df.to_json()` >>> > > * CSV: `df.to_csv()` >>> > > * PyArrow: `pyarrow.default_serialization_context().serialize(df)` >>> > > * PyArrow.Table: >>> > > >>> > > >>> `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))` >>> > > >>> > > Speed of serialization and memory footprint during that are probably >>> > > biggest factors (read: get rid of data, dump it to disk asap). >>> > > >>> > > Strangely in our benchmarks serializing `pyarrow.Table` seems the >>> most >>> > > balanced and quite fast. >>> > > >>> > > ## Questions: >>> > > >>> > > 1) Is there something I'm missing in understanding difference between >>> > > serializing dataframe directly using PyArrow and serializing >>> > > `pyarrow.Table`, Table shines in case dataframes mostly consists of >>> > > strings, which is frequent in our cases. >>> > > >>> > > 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? >>> It >>> > > seems to "just works", but mostly people just stick to Parquet or >>> something >>> > > else. >>> > > >>> > > 3) Parquet/Feather are as good as pyarrow.Table in terms of memory / >>> > > storage size, but quite slower on half-text dataframes, (2-3x >>> slower). >>> > > Could I be doing something wrong? >>> > > >>> > > In case of mixed-type dataframes JSON still seems like an option >>> according >>> > > to our benchmarks. >>> > > >>> > > 4) Feather seems to be REALLY close and similar in all benchmarks in >>> > > pyarrow.Table. Is feather using pyarrow.Table under the hood? >>> > > >>> > > ---------------------------------------------------- >>> > > ## Benchmarks: >>> > > >>> > > >>> https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0 >>> > > >>> > > Since we have mixed-type columns, for the following methods we do >>> > > astype(str) for all dtype=object columns before serialization: >>> > > * pyarrow.Table >>> > > * feather >>> > > * parquet >>> > > >>> > > It's also expensive but needed to be done since mixed-type columns >>> are not >>> > > supported for serialization in specified formats. Time to perform >>> this IS >>> > > INCLUDED into benchmarks. >>> > > >>> > > -- >>> > > Best wishes, >>> > > Bogdan Klichuk >>> > > >>> >> >> >> -- >> Best wishes, >> Bogdan Klichuk >> > -- Best wishes, Bogdan Klichuk