Suvayu Ali created ARROW-1956: --------------------------------- Summary: Support reading specific partitions from a partitioned parquet dataset Key: ARROW-1956 URL: https://issues.apache.org/jira/browse/ARROW-1956 Project: Apache Arrow Issue Type: Improvement Components: Format Affects Versions: 0.8.0 Environment: Kernel: 4.14.8-300.fc27.x86_64 Python: 3.6.3 Reporter: Suvayu Ali Priority: Minor Attachments: so-example.py
I want to read specific partitions from a partitioned parquet dataset. This is very useful in case of large datasets. I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below). # There is no way to read specific partitions in Pandas # In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work: # In PySpark it works if I simply do: {code:none} spark.read.options('basePath', 'datadir').parquet(*list_of_partitions) {code} I also couldn't find a way to easily write partitioned parquet files. In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script). Again, in PySpark I can do {code:none} df.write.partitionBy(*list_of_partitions).parquet(output) {code} to achieve that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)