Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Joris Van den Bossche Mon, 05 Jul 2021 23:48:36 -0700

There is a recent JIRA where a row-wise iterator was discussed:
https://issues.apache.org/jira/browse/ARROW-12970.


This should not be too hard to add (although there is a linked JIRA about
improving the performance of the pyarrow -> python objects conversion,
which might require some more engineering work to do), but of course what's
proposed in the JIRA is starting from a materialized record batch (so
similarly as the gist here, but I think this is good enough?).

On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> wrote:

> I think this type of thing does make sense, at some point people like to
> be be able see their data in rows.
>
> It probably pays to have this conversation on dev@.  Doing this in a
> performant way might take some engineering work, but having a quick
> solution like the one described above might make sense.
>
> -Micah
>
> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev>
> wrote:
>
>> Hello,
>>
>> I've found myself wondering if there is a use case for using the
>> iter_batches method in python as an iterator in a similar style to a
>> server-side cursor in Postgres. Right now you can use an iterator of record
>> batches, but I wondered if having some sort of python native iterator might
>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
>> batched iterator of native python objects?
>>
>> Here is some example code that shows a similar result.
>>
>> from itertools import chain
>> from typing import Tuple, Any
>>
>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> 
>> Tuple[Any]:
>>
>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, 
>> columns=columns)
>>
>>         # convert from columnar format of pyarrow arrays to a row format of 
>> python objects (yields tuples)
>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), 
>> batch.columns)) for batch in record_batches)
>>
>> (or a gist if you prefer:
>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>
>> I realize arrow is a columnar format, but I wonder if having the buffered
>> row reading as a lazy iterator is a common enough use case with parquet +
>> object storage being so common as a database alternative.
>>
>> Thanks,
>> Grant
>>
>> --
>> Grant Williams
>> Machine Learning Engineer
>> https://github.com/grantmwilliams/
>>
>

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Reply via email to