Re: Column/Partition Pruning

2019-05-29 Thread Russell Jurney
I've got things working like this: # Test ticker ticker = 'AAPL' stocks_close_ds = ParquetDataset( 'data/v4.parquet', filters=[('Ticker','=',ticker)] ) table = stocks_close_ds.read() stocks_close_df = table.to_pandas() stocks_close_df.head() # prints the filtered pandas.DataFrame I'll

Re: Column/Partition Pruning

2019-05-28 Thread Wes McKinney
hi Russell -- yes, you can use ParquetDataset directly and read to pandas. We have been discussing a more extensive Datasets framework in C++ that will also support multiple file formats and pluggable partition schemes, read more at https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB

Re: Column/Partition Pruning

2019-05-28 Thread Russell Jurney
Thanks, Joris. It looks like filters isn't a valid argument for pandas.read_parquet. Is it possible to instantiate a pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame and have the same effect? I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551 Thanks,

Re: Column/Partition Pruning

2019-05-27 Thread Joris Van den Bossche
Hi Russel, Yes and no. When specifying a column selection with read_parquet, indeed only the relevant columns will be loaded (since Parquet is a columnar storage, this is possible). But the filtering you show is done on the returned pandas DataFrame. And currently, pandas does not support any lazy