Re: Column/Partition Pruning

Joris Van den Bossche Mon, 27 May 2019 23:07:27 -0700

Hi Russel,

Yes and no. When specifying a column selection with read_parquet, indeed
only the relevant columns will be loaded (since Parquet is a columnar
storage, this is possible).
But the filtering you show is done on the returned pandas DataFrame. And
currently, pandas does not support any lazy operations, so the dataframe
returned by read_parquet (stocks_close_df) is the full, materialized
dataframe on which you then filter a subset.


But, filtering could also be done *when* reading the parquet file(s), to
actually prevent reading everything into memory. However, this is only
partly implemented in pyarrow at this moment. If you have a dataset
consisting of partitioned files in nested directories (Hive like), pyarrow
can filter on which files to read. See the "filters" keyword of
ParquetDataset (
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html).
I am only not fully sure you can already use this through the pandas
interface, it might be you need to use the pyarrow interface directly (in
which case, feel free to open an issue on the pandas issue tracker).
For filtering row groups within files, this is not yet implemented, there
is an open issue: https://issues.apache.org/jira/browse/ARROW-1796.

Best,
Joris

Op di 28 mei 2019 om 03:26 schreef Russell Jurney <[email protected]
>:

> Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes
> advantage of Parquet by only loading the relevant columns and by using the
> partition column(s) sub-directories if a partition column is included in
> the load and then filtered on later? Looking at the code for
> pandas.read_parquet it is not clear.
>
> For example something like:
>
> stocks_close_df = pd.read_parquet(
> 'data/v4.parquet',
> columns=['DateTime', 'Close', 'Ticker'],
> engine='pyarrow'
> )
>
> # Filter the data to just this ticker
> stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[
> 'DateTime', 'Close']]
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>

Re: Column/Partition Pruning

Reply via email to