Re: PyArrow + GCSFS not loading data when using filters... and also performance

2022-01-24 Thread Kelton Halbert
Thank you very much for the helpful response, Alenka. This provides much more clarity to the partitioning system and how I should be interacting with it. I’m in the process of re-processing my dataset to use integers for the date partitioning, but still use strings for the site identifiers. I do

Re: PyArrow + GCSFS not loading data when using filters...

2022-01-12 Thread Alenka Frim
Hello Kelton, playing around with the files you referenced and with the code you added the following can be observed and improved to make the code work: *1) Defining the partitioning of a dataset* When running *data.files* on your dataset shows that the files are partitioned according to the *hi

Re: PyArrow + GCSFS not loading data when using filters...

2022-01-09 Thread Kelton Halbert
An example using the pyarrow.dataset api… data = ds.dataset("global-radiosondes/hires_sonde", filesystem=fs, format="parquet", partitioning=["year", "month", "day", "hour", "site"]) subset = (ds.field("year") == "2022") & (ds.field("month") == "01") \ & (ds.field(

PyArrow + GCSFS not loading data when using filters...

2022-01-09 Thread Kelton Halbert
Hello - I’m not sure if this is a bug, or if I’m not using the API correctly, but I have a partitioned parquet dataset stored on a Google Cloud Bucket that I am attempting to load for analysis. However, when applying filters to the dataset (using both the pyarrow.dataset and pyarrow.parquet.Parq