hi Russell -- yes, you can use ParquetDataset directly and read to pandas.

We have been discussing a more extensive Datasets framework in C++
that will also support multiple file formats and pluggable partition
schemes, read more at

https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit

On Tue, May 28, 2019 at 8:21 PM Russell Jurney <russell.jur...@gmail.com> wrote:
>
> Thanks, Joris. It looks like filters isn't a valid argument for
> pandas.read_parquet. Is it possible to instantiate a
> pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame
> and have the same effect?
>
> I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Hi Russel,
> >
> > Yes and no. When specifying a column selection with read_parquet, indeed
> > only the relevant columns will be loaded (since Parquet is a columnar
> > storage, this is possible).
> > But the filtering you show is done on the returned pandas DataFrame. And
> > currently, pandas does not support any lazy operations, so the dataframe
> > returned by read_parquet (stocks_close_df) is the full, materialized
> > dataframe on which you then filter a subset.
> >
> > But, filtering could also be done *when* reading the parquet file(s), to
> > actually prevent reading everything into memory. However, this is only
> > partly implemented in pyarrow at this moment. If you have a dataset
> > consisting of partitioned files in nested directories (Hive like), pyarrow
> > can filter on which files to read. See the "filters" keyword of
> > ParquetDataset (
> >
> > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
> > ).
> > I am only not fully sure you can already use this through the pandas
> > interface, it might be you need to use the pyarrow interface directly (in
> > which case, feel free to open an issue on the pandas issue tracker).
> > For filtering row groups within files, this is not yet implemented, there
> > is an open issue: https://issues.apache.org/jira/browse/ARROW-1796.
> >
> > Best,
> > Joris
> >
> > Op di 28 mei 2019 om 03:26 schreef Russell Jurney <
> > russell.jur...@gmail.com
> > >:
> >
> > > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes
> > > advantage of Parquet by only loading the relevant columns and by using
> > the
> > > partition column(s) sub-directories if a partition column is included in
> > > the load and then filtered on later? Looking at the code for
> > > pandas.read_parquet it is not clear.
> > >
> > > For example something like:
> > >
> > > stocks_close_df = pd.read_parquet(
> > > 'data/v4.parquet',
> > > columns=['DateTime', 'Close', 'Ticker'],
> > > engine='pyarrow'
> > > )
> > >
> > > # Filter the data to just this ticker
> > > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[
> > > 'DateTime', 'Close']]
> > >
> > > Thanks,
> > > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > > <http://facebook.com/jurney> datasyndrome.com
> > >
> >

Reply via email to