Note that I was asked to post here after making a similar comment on GitHub (https://github.com/apache/arrow/pull/4236)…
I am hoping to help improve the use of pyarrow.parquet within dask (https://github.com/dask/dask). To this end, I put together a simple notebook to explore how pyarrow.parquet can be used to read/write a partitioned dataset without dask (see: https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). If your search for "Assuming that a single-file metadata solution is currently missing" in that notebook, you will see where I am unsure of the best way to write/read metadata to/from a centralized location using pyarrow.parquet. I believe that it would be best for dask to have a way to read/write a single metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ file?). Am I correct to assume that: (1) this functionality is missing in pyarrow, and (2) this approach is the best way to process a partitioned dataset in parallel? Best, Rick -- Richard J. Zamora NVIDA ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------