Suvayu Ali created ARROW-1956:
---------------------------------

             Summary: Support reading specific partitions from a partitioned 
parquet dataset
                 Key: ARROW-1956
                 URL: https://issues.apache.org/jira/browse/ARROW-1956
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Format
    Affects Versions: 0.8.0
         Environment: Kernel: 4.14.8-300.fc27.x86_64
Python: 3.6.3
            Reporter: Suvayu Ali
            Priority: Minor
         Attachments: so-example.py

I want to read specific partitions from a partitioned parquet dataset.  This is 
very useful in case of large datasets.  I have attached a small script that 
creates a dataset and shows what is expected when reading (quoting salient 
points below).

# There is no way to read specific partitions in Pandas
# In pyarrow I tried to achieve the goal by providing a list of 
files/directories to ParquetDataset, but it didn't work: 
# In PySpark it works if I simply do:
{code:none}
spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
{code}

I also couldn't find a way to easily write partitioned parquet files.  In the 
end I did it by hand by creating the directory hierarchies, and writing the 
individual files myself (similar to the implementation in the attached script). 
 Again, in PySpark I can do 
{code:none}
df.write.partitionBy(*list_of_partitions).parquet(output)
{code}
to achieve that.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to