Re: Column/Partition Pruning

Russell Jurney Wed, 29 May 2019 15:10:49 -0700

I've got things working like this:

# Test ticker
ticker = 'AAPL'


stocks_close_ds = ParquetDataset(
    'data/v4.parquet',
    filters=[('Ticker','=',ticker)]
)
table = stocks_close_ds.read()
stocks_close_df = table.to_pandas()

stocks_close_df.head() # prints the filtered pandas.DataFrame


I'll look at getting this working in pandas.

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Tue, May 28, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Russell -- yes, you can use ParquetDataset directly and read to pandas.
>
> We have been discussing a more extensive Datasets framework in C++
> that will also support multiple file formats and pluggable partition
> schemes, read more at
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit
>
> On Tue, May 28, 2019 at 8:21 PM Russell Jurney <russell.jur...@gmail.com>
> wrote:
> >
> > Thanks, Joris. It looks like filters isn't a valid argument for
> > pandas.read_parquet. Is it possible to instantiate a
> > pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame
> > and have the same effect?
> >
> > I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551
> >
> > Thanks,
> > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > <http://facebook.com/jurney> datasyndrome.com
> >
> >
> > On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > Hi Russel,
> > >
> > > Yes and no. When specifying a column selection with read_parquet,
> indeed
> > > only the relevant columns will be loaded (since Parquet is a columnar
> > > storage, this is possible).
> > > But the filtering you show is done on the returned pandas DataFrame.
> And
> > > currently, pandas does not support any lazy operations, so the
> dataframe
> > > returned by read_parquet (stocks_close_df) is the full, materialized
> > > dataframe on which you then filter a subset.
> > >
> > > But, filtering could also be done *when* reading the parquet file(s),
> to
> > > actually prevent reading everything into memory. However, this is only
> > > partly implemented in pyarrow at this moment. If you have a dataset
> > > consisting of partitioned files in nested directories (Hive like),
> pyarrow
> > > can filter on which files to read. See the "filters" keyword of
> > > ParquetDataset (
> > >
> > >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
> > > ).
> > > I am only not fully sure you can already use this through the pandas
> > > interface, it might be you need to use the pyarrow interface directly
> (in
> > > which case, feel free to open an issue on the pandas issue tracker).
> > > For filtering row groups within files, this is not yet implemented,
> there
> > > is an open issue: https://issues.apache.org/jira/browse/ARROW-1796.
> > >
> > > Best,
> > > Joris
> > >
> > > Op di 28 mei 2019 om 03:26 schreef Russell Jurney <
> > > russell.jur...@gmail.com
> > > >:
> > >
> > > > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes
> > > > advantage of Parquet by only loading the relevant columns and by
> using
> > > the
> > > > partition column(s) sub-directories if a partition column is
> included in
> > > > the load and then filtered on later? Looking at the code for
> > > > pandas.read_parquet it is not clear.
> > > >
> > > > For example something like:
> > > >
> > > > stocks_close_df = pd.read_parquet(
> > > > 'data/v4.parquet',
> > > > columns=['DateTime', 'Close', 'Ticker'],
> > > > engine='pyarrow'
> > > > )
> > > >
> > > > # Filter the data to just this ticker
> > > > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[
> > > > 'DateTime', 'Close']]
> > > >
> > > > Thanks,
> > > > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney>
> FB
> > > > <http://facebook.com/jurney> datasyndrome.com
> > > >
> > >
>

Re: Column/Partition Pruning

Reply via email to