I've got things working like this:
# Test ticker
ticker = 'AAPL'
stocks_close_ds = ParquetDataset(
'data/v4.parquet',
filters=[('Ticker','=',ticker)]
)
table = stocks_close_ds.read()
stocks_close_df = table.to_pandas()
stocks_close_df.head() # prints the filtered pandas.DataFrame
I'll
hi Russell -- yes, you can use ParquetDataset directly and read to pandas.
We have been discussing a more extensive Datasets framework in C++
that will also support multiple file formats and pluggable partition
schemes, read more at
https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB
Thanks, Joris. It looks like filters isn't a valid argument for
pandas.read_parquet. Is it possible to instantiate a
pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame
and have the same effect?
I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551
Thanks,
Hi Russel,
Yes and no. When specifying a column selection with read_parquet, indeed
only the relevant columns will be loaded (since Parquet is a columnar
storage, this is possible).
But the filtering you show is done on the returned pandas DataFrame. And
currently, pandas does not support any lazy