I've got things working like this: # Test ticker ticker = 'AAPL'
stocks_close_ds = ParquetDataset( 'data/v4.parquet', filters=[('Ticker','=',ticker)] ) table = stocks_close_ds.read() stocks_close_df = table.to_pandas() stocks_close_df.head() # prints the filtered pandas.DataFrame I'll look at getting this working in pandas. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com On Tue, May 28, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Russell -- yes, you can use ParquetDataset directly and read to pandas. > > We have been discussing a more extensive Datasets framework in C++ > that will also support multiple file formats and pluggable partition > schemes, read more at > > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit > > On Tue, May 28, 2019 at 8:21 PM Russell Jurney <russell.jur...@gmail.com> > wrote: > > > > Thanks, Joris. It looks like filters isn't a valid argument for > > pandas.read_parquet. Is it possible to instantiate a > > pyarrow.parquet.ParquetDataset and then convert it to a pandas.DataFrame > > and have the same effect? > > > > I filed an issue here: https://github.com/pandas-dev/pandas/issues/26551 > > > > Thanks, > > Russell Jurney @rjurney <http://twitter.com/rjurney> > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > > <http://facebook.com/jurney> datasyndrome.com > > > > > > On Mon, May 27, 2019 at 11:06 PM Joris Van den Bossche < > > jorisvandenboss...@gmail.com> wrote: > > > > > Hi Russel, > > > > > > Yes and no. When specifying a column selection with read_parquet, > indeed > > > only the relevant columns will be loaded (since Parquet is a columnar > > > storage, this is possible). > > > But the filtering you show is done on the returned pandas DataFrame. > And > > > currently, pandas does not support any lazy operations, so the > dataframe > > > returned by read_parquet (stocks_close_df) is the full, materialized > > > dataframe on which you then filter a subset. > > > > > > But, filtering could also be done *when* reading the parquet file(s), > to > > > actually prevent reading everything into memory. However, this is only > > > partly implemented in pyarrow at this moment. If you have a dataset > > > consisting of partitioned files in nested directories (Hive like), > pyarrow > > > can filter on which files to read. See the "filters" keyword of > > > ParquetDataset ( > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html > > > ). > > > I am only not fully sure you can already use this through the pandas > > > interface, it might be you need to use the pyarrow interface directly > (in > > > which case, feel free to open an issue on the pandas issue tracker). > > > For filtering row groups within files, this is not yet implemented, > there > > > is an open issue: https://issues.apache.org/jira/browse/ARROW-1796. > > > > > > Best, > > > Joris > > > > > > Op di 28 mei 2019 om 03:26 schreef Russell Jurney < > > > russell.jur...@gmail.com > > > >: > > > > > > > Hello, I am wondering if pandas.read_parquet(engine='pyarrow') takes > > > > advantage of Parquet by only loading the relevant columns and by > using > > > the > > > > partition column(s) sub-directories if a partition column is > included in > > > > the load and then filtered on later? Looking at the code for > > > > pandas.read_parquet it is not clear. > > > > > > > > For example something like: > > > > > > > > stocks_close_df = pd.read_parquet( > > > > 'data/v4.parquet', > > > > columns=['DateTime', 'Close', 'Ticker'], > > > > engine='pyarrow' > > > > ) > > > > > > > > # Filter the data to just this ticker > > > > stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker][[ > > > > 'DateTime', 'Close']] > > > > > > > > Thanks, > > > > Russell Jurney @rjurney <http://twitter.com/rjurney> > > > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> > FB > > > > <http://facebook.com/jurney> datasyndrome.com > > > > > > > >