The signature I had in mind is something like: def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:
where the function returns a pyarrow.Array, and takes an optional data type (in case there are multiple ways to convert to a pyarrow Array, and what can be passed by the user in the type argument in pyarrow.array(..) or in a specified schema). But, the above is only for a one way path of custom array to Arrow array, and not enough for a full roundtrip. For a full roundtrip in case of a pandas DataFrame, we will still need to save information in metadata independently from __arrow_array__ and have custom code in pyarrow to deal with pandas DataFrames (of which there is already a lot). I mentioned this briefly in https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 / https://issues.apache.org/jira/browse/ARROW-2428, but one option could be to save the name of the pandas extension dtype in the pandas_metadata of an arrow Table (just as already happens for currently supported types), and when exporting back to pandas with to_pandas pyarrow could check if this extension dtype name is registered with pandas and if so, call a method there to construct it. Joris Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>: > > Hi Joris, > > Do you have a signature for __arrow_array__ method in mind? > > For example, let's say you want to roundtrip ExtensionArrays or other > third-party data through Arrow. How do you preserve the required metadata? > > Regards > > Antoine. > > > Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit : > > Hi all, > > > > I want to propose an interface to allow custom array objects in Python to > > define how they should be converted to Arrow arrays (e.g. in > > pyarrow.array(..)). I opened > > https://issues.apache.org/jira/browse/ARROW-5271 for this. > > This would be similar to the numpy __array__ protocol (so we could eg > call > > it __arrow_array__). > > Feedback / discussion very welcome! > > > > I am coming to this discussion specifically from the point of view of > > pandas ExtensionArrays (github issue for this: > > > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > ). > > Such a protocol would, for example, make it possible that pandas users > can > > save DataFrames with ExtensionArrays (eg the nullable integers) to > parquet, > > without the need for pyarrow to know about all those possible different > > extension arrays. This would also be useful for projects extending pandas > > such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher > > <https://github.com/xhochy/fletcher>. > > But I suppose it could also be of interest more in general of other > > array-like / pandas-like projects that want to interface with arrow. > > > > Sidenote: for the pandas case, I want to look a the full roundtrip, so > also > > the conversion back from an arrow Table to DataFrame. For that aspect > there > > is https://issues.apache.org/jira/browse/ARROW-2428, but this is much > more > > specific to pandas and its ExtensionArrays. > > > > Regards, > > Joris > > >