There is a recent JIRA where a row-wise iterator was discussed: https://issues.apache.org/jira/browse/ARROW-12970.
This should not be too hard to add (although there is a linked JIRA about improving the performance of the pyarrow -> python objects conversion, which might require some more engineering work to do), but of course what's proposed in the JIRA is starting from a materialized record batch (so similarly as the gist here, but I think this is good enough?). On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> wrote: > I think this type of thing does make sense, at some point people like to > be be able see their data in rows. > > It probably pays to have this conversation on dev@. Doing this in a > performant way might take some engineering work, but having a quick > solution like the one described above might make sense. > > -Micah > > On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> > wrote: > >> Hello, >> >> I've found myself wondering if there is a use case for using the >> iter_batches method in python as an iterator in a similar style to a >> server-side cursor in Postgres. Right now you can use an iterator of record >> batches, but I wondered if having some sort of python native iterator might >> be worth it? Maybe a .to_pyiter() method that converts it to a lazy & >> batched iterator of native python objects? >> >> Here is some example code that shows a similar result. >> >> from itertools import chain >> from typing import Tuple, Any >> >> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> >> Tuple[Any]: >> >> record_batches = parquet_file.iter_batches(batch_size=batch_size, >> columns=columns) >> >> # convert from columnar format of pyarrow arrays to a row format of >> python objects (yields tuples) >> yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), >> batch.columns)) for batch in record_batches) >> >> (or a gist if you prefer: >> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d) >> >> I realize arrow is a columnar format, but I wonder if having the buffered >> row reading as a lazy iterator is a common enough use case with parquet + >> object storage being so common as a database alternative. >> >> Thanks, >> Grant >> >> -- >> Grant Williams >> Machine Learning Engineer >> https://github.com/grantmwilliams/ >> >