Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Joris Van den Bossche Thu, 09 May 2019 09:10:25 -0700

An additional question might be at which "level" to provide such a hook to
third-party packages: I proposed for Array, but what for chunked arrays,
columns or tables? Maybe at least returning a chunked array should also be
allowed.


Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche <
jorisvandenboss...@gmail.com>:

> The signature I had in mind is something like:
>
> def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:
>
> where the function returns a pyarrow.Array, and takes an optional data
> type (in case there are multiple ways to convert to a pyarrow Array, and
> what can be passed by the user in the type argument in pyarrow.array(..) or
> in a specified schema).
>
> But, the above is only for a one way path of custom array to Arrow array,
> and not enough for a full roundtrip.
>
> For a full roundtrip in case of a pandas DataFrame, we will still need to
> save information in metadata independently from __arrow_array__ and have
> custom code in pyarrow to deal with pandas DataFrames (of which there is
> already a lot). I mentioned this briefly in
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
> / https://issues.apache.org/jira/browse/ARROW-2428, but one option could
> be to save the name of the pandas extension dtype in the pandas_metadata of
> an arrow Table (just as already happens for currently supported types), and
> when exporting back to pandas with to_pandas pyarrow could check if this
> extension dtype name is registered with pandas and if so, call a method
> there to construct it.
>
> Joris
>
> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>:
>
>>
>> Hi Joris,
>>
>> Do you have a signature for __arrow_array__ method in mind?
>>
>> For example, let's say you want to roundtrip ExtensionArrays or other
>> third-party data through Arrow.  How do you preserve the required
>> metadata?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
>> > Hi all,
>> >
>> > I want to propose an interface to allow custom array objects in Python
>> to
>> > define how they should be converted to Arrow arrays (e.g. in
>> > pyarrow.array(..)). I opened
>> > https://issues.apache.org/jira/browse/ARROW-5271 for this.
>> > This would be similar to the numpy __array__ protocol (so we could eg
>> call
>> > it __arrow_array__).
>> > Feedback / discussion very welcome!
>> >
>> > I am coming to this discussion specifically from the point of view of
>> > pandas ExtensionArrays (github issue for this:
>> >
>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>> ).
>> > Such a protocol would, for example, make it possible that pandas users
>> can
>> > save DataFrames with ExtensionArrays (eg the nullable integers) to
>> parquet,
>> > without the need for pyarrow to know about all those possible different
>> > extension arrays. This would also be useful for projects extending
>> pandas
>> > such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher
>> > <https://github.com/xhochy/fletcher>.
>> > But I suppose it could also be of interest more in general of other
>> > array-like / pandas-like projects that want to interface with arrow.
>> >
>> > Sidenote: for the pandas case, I want to look a the full roundtrip, so
>> also
>> > the conversion back from an arrow Table to DataFrame. For that aspect
>> there
>> > is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
>> more
>> > specific to pandas and its ExtensionArrays.
>> >
>> > Regards,
>> > Joris
>> >
>>
>

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to