Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Wes McKinney Tue, 06 Jul 2021 06:28:50 -0700

I left a comment in Jira, but I agree that having a faster method to
"box" Arrow array values as Python objects would be useful in a lot of
places. Then these common C++ code paths could be used to "tupleize"
record batches reasonably efficiently


On Tue, Jul 6, 2021 at 3:08 PM Alessandro Molina
<alessan...@ursacomputing.com> wrote:
>
> I guess that doing it at the Parquet reader level might allow the 
> implementation to better leverage row groups, without the need to keep in 
> memory the whole Table when you are iterating over data. While the current 
> jira issue seems to suggest the implementation for Table once it's already 
> fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche 
> <jorisvandenboss...@gmail.com> wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: 
>> https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about 
>> improving the performance of the pyarrow -> python objects conversion, which 
>> might require some more engineering work to do), but of course what's 
>> proposed in the JIRA is starting from a materialized record batch (so 
>> similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <emkornfi...@gmail.com> wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be 
>>> be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a 
>>> performant way might take some engineering work, but having a quick 
>>> solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> 
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've found myself wondering if there is a use case for using the 
>>>> iter_batches method in python as an iterator in a similar style to a 
>>>> server-side cursor in Postgres. Right now you can use an iterator of 
>>>> record batches, but I wondered if having some sort of python native 
>>>> iterator might be worth it? Maybe a .to_pyiter() method that converts it 
>>>> to a lazy & batched iterator of native python objects?
>>>>
>>>> Here is some example code that shows a similar result.
>>>>
>>>> from itertools import chain
>>>> from typing import Tuple, Any
>>>>
>>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> 
>>>> Tuple[Any]:
>>>>
>>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, 
>>>> columns=columns)
>>>>
>>>>         # convert from columnar format of pyarrow arrays to a row format 
>>>> of python objects (yields tuples)
>>>>         yield from chain.from_iterable(zip(*map(lambda col: 
>>>> col.to_pylist(), batch.columns)) for batch in record_batches)
>>>>
>>>> (or a gist if you prefer: 
>>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>>
>>>> I realize arrow is a columnar format, but I wonder if having the buffered 
>>>> row reading as a lazy iterator is a common enough use case with parquet + 
>>>> object storage being so common as a database alternative.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> --
>>>> Grant Williams
>>>> Machine Learning Engineer
>>>> https://github.com/grantmwilliams/

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Reply via email to