Thanks, Joris. It looks like filters isn't a valid argument for pandas.read_parquet. Is it possible to instantiate a pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame and have the same effect?
I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551 Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Hi Russel, > > Yes and no. When specifying a column selection with read_parquet, indeed > only the relevant columns will be loaded (since Parquet is a columnar > storage, this is possible). > But the filtering you show is done on the returned pandas DataFrame. And > currently, pandas does not support any lazy operations, so the dataframe > returned by read_parquet (stocks_close_df) is the full, materialized > dataframe on which you then filter a subset. > > But, filtering could also be done *when* reading the parquet file(s), to > actually prevent reading everything into memory. However, this is only > partly implemented in pyarrow at this moment. If you have a dataset > consisting of partitioned files in nested directories (Hive like), pyarrow > can filter on which files to read. See the "filters" keyword of > ParquetDataset ( > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html > ). > I am only not fully sure you can already use this through the pandas > interface, it might be you need to use the pyarrow interface directly (in > which case, feel free to open an issue on the pandas issue tracker). > For filtering row groups within files, this is not yet implemented, there > is an open issue: https://issues.apache.org/jira/browse/ARROW-1796. > > Best, > Joris > > Op di 28 mei 2019 om 03:26 schreef Russell Jurney < > russell.jur...@gmail.com > >: > > > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes > > advantage of Parquet by only loading the relevant columns and by using > the > > partition column(s) sub-directories if a partition column is included in > > the load and then filtered on later? Looking at the code for > > pandas.read_parquet it is not clear. > > > > For example something like: > > > > stocks_close_df = pd.read_parquet( > > 'data/v4.parquet', > > columns=['DateTime', 'Close', 'Ticker'], > > engine='pyarrow' > > ) > > > > # Filter the data to just this ticker > > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[ > > 'DateTime', 'Close']] > > > > Thanks, > > Russell Jurney @rjurney <http://twitter.com/rjurney> > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > > <http://facebook.com/jurney> datasyndrome.com > > >