Trying to come up with a solution for quick Pandas dataframes serialization and long-storage. Dataframe content is tabular, but provided by user, can be arbitrary, so might both completely text columns and completely numeric/boolean columns.
## Main goals are: * Serialize dataframe as quickly as possible in order to dump it on disk. * Use format, that i'll be able to load from disk later back into dataframe. * Well, the least memory footprint of serialization and compact output file. Have ran benchmarks comparing different serialization methods, including: * Parquet: `df.to_parquet()` * Feather: `df.to_feather()` * JSON: `df.to_json()` * CSV: `df.to_csv()` * PyArrow: `pyarrow.default_serialization_context().serialize(df)` * PyArrow.Table: `pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))` Speed of serialization and memory footprint during that are probably biggest factors (read: get rid of data, dump it to disk asap). Strangely in our benchmarks serializing `pyarrow.Table` seems the most balanced and quite fast. ## Questions: 1) Is there something I'm missing in understanding difference between serializing dataframe directly using PyArrow and serializing `pyarrow.Table`, Table shines in case dataframes mostly consists of strings, which is frequent in our cases. 2) Is `pyarrow.Table` a valid option for long-storage of dataframes? It seems to "just works", but mostly people just stick to Parquet or something else. 3) Parquet/Feather are as good as pyarrow.Table in terms of memory / storage size, but quite slower on half-text dataframes, (2-3x slower). Could I be doing something wrong? In case of mixed-type dataframes JSON still seems like an option according to our benchmarks. 4) Feather seems to be REALLY close and similar in all benchmarks in pyarrow.Table. Is feather using pyarrow.Table under the hood? ---------------------------------------------------- ## Benchmarks: https://docs.google.com/spreadsheets/d/1O81AEZrfGMTJAB-ozZ4YZmVzriKTDrm34u-gENgyiWo/edit#gid=0 Since we have mixed-type columns, for the following methods we do astype(str) for all dtype=object columns before serialization: * pyarrow.Table * feather * parquet It's also expensive but needed to be done since mixed-type columns are not supported for serialization in specified formats. Time to perform this IS INCLUDED into benchmarks. -- Best wishes, Bogdan Klichuk