I left a comment in Jira, but I agree that having a faster method to "box" Arrow array values as Python objects would be useful in a lot of places. Then these common C++ code paths could be used to "tupleize" record batches reasonably efficiently
On Tue, Jul 6, 2021 at 3:08 PM Alessandro Molina <alessan...@ursacomputing.com> wrote: > > I guess that doing it at the Parquet reader level might allow the > implementation to better leverage row groups, without the need to keep in > memory the whole Table when you are iterating over data. While the current > jira issue seems to suggest the implementation for Table once it's already > fully available. > > On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: >> >> There is a recent JIRA where a row-wise iterator was discussed: >> https://issues.apache.org/jira/browse/ARROW-12970. >> >> This should not be too hard to add (although there is a linked JIRA about >> improving the performance of the pyarrow -> python objects conversion, which >> might require some more engineering work to do), but of course what's >> proposed in the JIRA is starting from a materialized record batch (so >> similarly as the gist here, but I think this is good enough?). >> >> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> wrote: >>> >>> I think this type of thing does make sense, at some point people like to be >>> be able see their data in rows. >>> >>> It probably pays to have this conversation on dev@. Doing this in a >>> performant way might take some engineering work, but having a quick >>> solution like the one described above might make sense. >>> >>> -Micah >>> >>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> >>> wrote: >>>> >>>> Hello, >>>> >>>> I've found myself wondering if there is a use case for using the >>>> iter_batches method in python as an iterator in a similar style to a >>>> server-side cursor in Postgres. Right now you can use an iterator of >>>> record batches, but I wondered if having some sort of python native >>>> iterator might be worth it? Maybe a .to_pyiter() method that converts it >>>> to a lazy & batched iterator of native python objects? >>>> >>>> Here is some example code that shows a similar result. >>>> >>>> from itertools import chain >>>> from typing import Tuple, Any >>>> >>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> >>>> Tuple[Any]: >>>> >>>> record_batches = parquet_file.iter_batches(batch_size=batch_size, >>>> columns=columns) >>>> >>>> # convert from columnar format of pyarrow arrays to a row format >>>> of python objects (yields tuples) >>>> yield from chain.from_iterable(zip(*map(lambda col: >>>> col.to_pylist(), batch.columns)) for batch in record_batches) >>>> >>>> (or a gist if you prefer: >>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d) >>>> >>>> I realize arrow is a columnar format, but I wonder if having the buffered >>>> row reading as a lazy iterator is a common enough use case with parquet + >>>> object storage being so common as a database alternative. >>>> >>>> Thanks, >>>> Grant >>>> >>>> -- >>>> Grant Williams >>>> Machine Learning Engineer >>>> https://github.com/grantmwilliams/