Note that the "iter_batches" method on ParquetFile already gives you a way to consume the Parquet file progressively with a stream of RecordBatches without creating a single Table for the full Parquet file (which will already leverage the row groups of the Parquet file). The example in the JIRA used Table, but ther is no reason to not expose such an iteration method on RecordBatch as well (and I had updated the title of the JIRA to reflect that).
On Tue, 6 Jul 2021 at 15:08, Alessandro Molina <alessan...@ursacomputing.com> wrote: > > I guess that doing it at the Parquet reader level might allow the > implementation to better leverage row groups, without the need to keep in > memory the whole Table when you are iterating over data. While the current > jira issue seems to suggest the implementation for Table once it's already > fully available. > > On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: >> >> There is a recent JIRA where a row-wise iterator was discussed: >> https://issues.apache.org/jira/browse/ARROW-12970. >> >> This should not be too hard to add (although there is a linked JIRA about >> improving the performance of the pyarrow -> python objects conversion, which >> might require some more engineering work to do), but of course what's >> proposed in the JIRA is starting from a materialized record batch (so >> similarly as the gist here, but I think this is good enough?). >> >> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> wrote: >>> >>> I think this type of thing does make sense, at some point people like to be >>> be able see their data in rows. >>> >>> It probably pays to have this conversation on dev@. Doing this in a >>> performant way might take some engineering work, but having a quick >>> solution like the one described above might make sense. >>> >>> -Micah >>> >>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> >>> wrote: >>>> >>>> Hello, >>>> >>>> I've found myself wondering if there is a use case for using the >>>> iter_batches method in python as an iterator in a similar style to a >>>> server-side cursor in Postgres. Right now you can use an iterator of >>>> record batches, but I wondered if having some sort of python native >>>> iterator might be worth it? Maybe a .to_pyiter() method that converts it >>>> to a lazy & batched iterator of native python objects? >>>> >>>> Here is some example code that shows a similar result. >>>> >>>> from itertools import chain >>>> from typing import Tuple, Any >>>> >>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> >>>> Tuple[Any]: >>>> >>>> record_batches = parquet_file.iter_batches(batch_size=batch_size, >>>> columns=columns) >>>> >>>> # convert from columnar format of pyarrow arrays to a row format >>>> of python objects (yields tuples) >>>> yield from chain.from_iterable(zip(*map(lambda col: >>>> col.to_pylist(), batch.columns)) for batch in record_batches) >>>> >>>> (or a gist if you prefer: >>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d) >>>> >>>> I realize arrow is a columnar format, but I wonder if having the buffered >>>> row reading as a lazy iterator is a common enough use case with parquet + >>>> object storage being so common as a database alternative. >>>> >>>> Thanks, >>>> Grant >>>> >>>> -- >>>> Grant Williams >>>> Machine Learning Engineer >>>> https://github.com/grantmwilliams/