Metadata for partitioned datasets in pyarrow.parquet

Richard Zamora Thu, 16 May 2019 07:37:36 -0700

Note that I was asked to post here after making a similar comment on GitHub 
(https://github.com/apache/arrow/pull/4236)…


I am hoping to help improve the use of pyarrow.parquet within dask 
(https://github.com/dask/dask). To this end, I put together a simple notebook 
to explore how pyarrow.parquet can be used to read/write a partitioned dataset 
without dask (see: 
https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). 
 If your search for "Assuming that a single-file metadata solution is currently 
missing" in that notebook, you will see where I am unsure of the best way to 
write/read metadata to/from a centralized location using pyarrow.parquet.

I believe that it would be best for dask to have a way to read/write a single 
metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ 
file?).   Am I correct to assume that: (1) this functionality is missing in 
pyarrow, and (2) this  approach is the best way to process a partitioned 
dataset in parallel?

Best,
Rick

--
Richard J. Zamora
NVIDA



-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Metadata for partitioned datasets in pyarrow.parquet

Reply via email to