I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully available.
On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > There is a recent JIRA where a row-wise iterator was discussed: > https://issues.apache.org/jira/browse/ARROW-12970. > > This should not be too hard to add (although there is a linked JIRA about > improving the performance of the pyarrow -> python objects conversion, > which might require some more engineering work to do), but of course what's > proposed in the JIRA is starting from a materialized record batch (so > similarly as the gist here, but I think this is good enough?). > > On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I think this type of thing does make sense, at some point people like to >> be be able see their data in rows. >> >> It probably pays to have this conversation on dev@. Doing this in a >> performant way might take some engineering work, but having a quick >> solution like the one described above might make sense. >> >> -Micah >> >> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> >> wrote: >> >>> Hello, >>> >>> I've found myself wondering if there is a use case for using the >>> iter_batches method in python as an iterator in a similar style to a >>> server-side cursor in Postgres. Right now you can use an iterator of record >>> batches, but I wondered if having some sort of python native iterator might >>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy & >>> batched iterator of native python objects? >>> >>> Here is some example code that shows a similar result. >>> >>> from itertools import chain >>> from typing import Tuple, Any >>> >>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> >>> Tuple[Any]: >>> >>> record_batches = parquet_file.iter_batches(batch_size=batch_size, >>> columns=columns) >>> >>> # convert from columnar format of pyarrow arrays to a row format of >>> python objects (yields tuples) >>> yield from chain.from_iterable(zip(*map(lambda col: >>> col.to_pylist(), batch.columns)) for batch in record_batches) >>> >>> (or a gist if you prefer: >>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d) >>> >>> I realize arrow is a columnar format, but I wonder if having the >>> buffered row reading as a lazy iterator is a common enough use case with >>> parquet + object storage being so common as a database alternative. >>> >>> Thanks, >>> Grant >>> >>> -- >>> Grant Williams >>> Machine Learning Engineer >>> https://github.com/grantmwilliams/ >>> >>