Re: Optimized way of converting list of pa.Array to pd.DataFrame with index

Li Jin Thu, 31 Aug 2023 10:05:40 -0700

Ah I see - thanks for the explanation. self_destruct probably won't benefit
in my case then. (The pa.Array here is a slice from another batch so there
will be other references to the data backing this array)


On Thu, Aug 31, 2023 at 11:24 AM David Li <lidav...@apache.org> wrote:

> Not sure about the conversion, but regarding self_destruct: the problem is
> that it only provides memory savings in limited situations that are hard to
> figure out from the outside. When enabled, PyArrow will always discard the
> reference to the array after conversion, and if there are no other
> references, that would free the array. But different arrays may be backed
> by the same underlying memory buffer (this is generally true for IPC and
> Flight, for example), so freeing the array won't actually free any memory
> since the buffer is still alive. It would only save memory if you ensure
> each array is actually backed by its own memory allocations (which right
> would generally mean copying data up front!).
>
> On Thu, Aug 31, 2023, at 11:11, Li Jin wrote:
> > Hi,
> >
> > I am working on some code where I have a list of pa.Arrays and I am
> > creating a pandas.DataFrame from it. I also want to set the index of the
> > pd.DataFrame to be the first Array in the list.
> >
> > Currently I am doing sth like:
> > "
> > df = pa.Table.from_arrays(arrs, names=input_names).to_pandas()
> > df.set_index(input_names[0], inplace=True)
> > "
> >
> > I am curious if this is the best I can do? Also I wonder if it is still
> > worthwhile to use the "self_destruct=True" option here (I noticed it has
> > been EXPERIMENTAL for a long time)
> >
> > Thanks!
> > Li
>

Re: Optimized way of converting list of pa.Array to pd.DataFrame with index

Reply via email to