Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Joris Van den Bossche Thu, 09 May 2019 09:07:12 -0700

The signature I had in mind is something like:

def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:


where the function returns a pyarrow.Array, and takes an optional data type
(in case there are multiple ways to convert to a pyarrow Array, and what
can be passed by the user in the type argument in pyarrow.array(..) or in a
specified schema).

But, the above is only for a one way path of custom array to Arrow array,
and not enough for a full roundtrip.

For a full roundtrip in case of a pandas DataFrame, we will still need to
save information in metadata independently from __arrow_array__ and have
custom code in pyarrow to deal with pandas DataFrames (of which there is
already a lot). I mentioned this briefly in
https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 /
https://issues.apache.org/jira/browse/ARROW-2428, but one option could be
to save the name of the pandas extension dtype in the pandas_metadata of an
arrow Table (just as already happens for currently supported types), and
when exporting back to pandas with to_pandas pyarrow could check if this
extension dtype name is registered with pandas and if so, call a method
there to construct it.

Joris

Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>:

>
> Hi Joris,
>
> Do you have a signature for __arrow_array__ method in mind?
>
> For example, let's say you want to roundtrip ExtensionArrays or other
> third-party data through Arrow.  How do you preserve the required metadata?
>
> Regards
>
> Antoine.
>
>
> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
> > Hi all,
> >
> > I want to propose an interface to allow custom array objects in Python to
> > define how they should be converted to Arrow arrays (e.g. in
> > pyarrow.array(..)). I opened
> > https://issues.apache.org/jira/browse/ARROW-5271 for this.
> > This would be similar to the numpy __array__ protocol (so we could eg
> call
> > it __arrow_array__).
> > Feedback / discussion very welcome!
> >
> > I am coming to this discussion specifically from the point of view of
> > pandas ExtensionArrays (github issue for this:
> >
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
> ).
> > Such a protocol would, for example, make it possible that pandas users
> can
> > save DataFrames with ExtensionArrays (eg the nullable integers) to
> parquet,
> > without the need for pyarrow to know about all those possible different
> > extension arrays. This would also be useful for projects extending pandas
> > such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher
> > <https://github.com/xhochy/fletcher>.
> > But I suppose it could also be of interest more in general of other
> > array-like / pandas-like projects that want to interface with arrow.
> >
> > Sidenote: for the pandas case, I want to look a the full roundtrip, so
> also
> > the conversion back from an arrow Table to DataFrame. For that aspect
> there
> > is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
> more
> > specific to pandas and its ExtensionArrays.
> >
> > Regards,
> > Joris
> >
>

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to