I think your original code roundtripping through RecordBatch
(`pa.RecordBatch.from_pandas(df).to_struct_array()`) is the best
option at the moment. The RecordBatch<->StructArray part is a cheap
(zero-copy) conversion, and by using RecordBatch.from_pandas, you can
rely on all pandas<->arrow conversion logic that is implemented in
pyarrow (and which keeps the data columnar, in contrast to
`df.itertuples()` which converts the data into rows of python objects
as intermediate).

Given that the conversion through RecordBatch works nicely, I am not
sure it is worth it to add new APIs to directly convert between
StructArray and pandas DataFrames.

Joris

On Mon, 12 Jun 2023 at 20:32, Spencer Nelson <swnel...@uw.edu> wrote:
>
> Here's a one-liner that does it, but I expect it's moderately slower than
> the RecordBatch version:
>
> pa.array(df.itertuples(index=False), type=pa.struct([pa.field(col,
> pa.from_numpy_dtype(df.dtypes[col])) for col in df.columns]))
>
> Most of the complexity is in the 'type'. It's less scary than it looks, and
> if you can afford multiple lines I think it's almost readable:
>
> fields = [pa.field(col, pa.from_numpy_dtype(df.dtypes[col])) for col in
> df.columns]
> pa_type = pa.struct(fields)
> pa.array(df.itertuples(index=False, type=pa_type)
>
> But this seems like a classic XY problem. What is the root issue you're
> trying to solve? Why avoid RecordBatch?
>
> On Mon, Jun 12, 2023 at 11:14 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> > !-------------------------------------------------------------------|
> >   This Message Is From an Untrusted Sender
> >   You have not previously corresponded with this sender.
> >   See https://itconnect.uw.edu/email-tags for additional
> >   information.  Please contact the UW-IT Service Center,
> >   h...@uw.edu 206.221.5000, for assistance.
> > |-------------------------------------------------------------------!
> >
> > Gentle bump.
> >
> > Not a big deal if I need to use the API above to do so, but bump in case
> > someone has a better way.
> >
> > On Fri, Jun 9, 2023 at 4:34 PM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I am looking for the best ways for converting Pandas DataFrame <-> Struct
> > > Array.
> > >
> > > Currently I have:
> > >
> > > pa.RecordBatch.from_pandas(df).to_struct_array()
> > >
> > > and
> > >
> > > pa.RecordBatch.from_struct_array(s_array).to_pandas()
> > >
> > > - I wonder if there is a direct way to go from DataFrame <-> Struct Array
> > > without going through RecordBatch?
> > >
> > > Thanks,
> > > Li
> > >
> >

Reply via email to