Thanks, Joris. It looks like filters isn't a valid argument for
pandas.read_parquet. Is it possible to instantiate a
pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame
and have the same effect?

I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Russel,
>
> Yes and no. When specifying a column selection with read_parquet, indeed
> only the relevant columns will be loaded (since Parquet is a columnar
> storage, this is possible).
> But the filtering you show is done on the returned pandas DataFrame. And
> currently, pandas does not support any lazy operations, so the dataframe
> returned by read_parquet (stocks_close_df) is the full, materialized
> dataframe on which you then filter a subset.
>
> But, filtering could also be done *when* reading the parquet file(s), to
> actually prevent reading everything into memory. However, this is only
> partly implemented in pyarrow at this moment. If you have a dataset
> consisting of partitioned files in nested directories (Hive like), pyarrow
> can filter on which files to read. See the "filters" keyword of
> ParquetDataset (
>
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
> ).
> I am only not fully sure you can already use this through the pandas
> interface, it might be you need to use the pyarrow interface directly (in
> which case, feel free to open an issue on the pandas issue tracker).
> For filtering row groups within files, this is not yet implemented, there
> is an open issue: https://issues.apache.org/jira/browse/ARROW-1796.
>
> Best,
> Joris
>
> Op di 28 mei 2019 om 03:26 schreef Russell Jurney <
> russell.jur...@gmail.com
> >:
>
> > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes
> > advantage of Parquet by only loading the relevant columns and by using
> the
> > partition column(s) sub-directories if a partition column is included in
> > the load and then filtered on later? Looking at the code for
> > pandas.read_parquet it is not clear.
> >
> > For example something like:
> >
> > stocks_close_df = pd.read_parquet(
> > 'data/v4.parquet',
> > columns=['DateTime', 'Close', 'Ticker'],
> > engine='pyarrow'
> > )
> >
> > # Filter the data to just this ticker
> > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[
> > 'DateTime', 'Close']]
> >
> > Thanks,
> > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > <http://facebook.com/jurney> datasyndrome.com
> >
>

Reply via email to