An additional question might be at which "level" to provide such a hook to third-party packages: I proposed for Array, but what for chunked arrays, columns or tables? Maybe at least returning a chunked array should also be allowed.
Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche < jorisvandenboss...@gmail.com>: > The signature I had in mind is something like: > > def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array: > > where the function returns a pyarrow.Array, and takes an optional data > type (in case there are multiple ways to convert to a pyarrow Array, and > what can be passed by the user in the type argument in pyarrow.array(..) or > in a specified schema). > > But, the above is only for a one way path of custom array to Arrow array, > and not enough for a full roundtrip. > > For a full roundtrip in case of a pandas DataFrame, we will still need to > save information in metadata independently from __arrow_array__ and have > custom code in pyarrow to deal with pandas DataFrames (of which there is > already a lot). I mentioned this briefly in > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > / https://issues.apache.org/jira/browse/ARROW-2428, but one option could > be to save the name of the pandas extension dtype in the pandas_metadata of > an arrow Table (just as already happens for currently supported types), and > when exporting back to pandas with to_pandas pyarrow could check if this > extension dtype name is registered with pandas and if so, call a method > there to construct it. > > Joris > > Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>: > >> >> Hi Joris, >> >> Do you have a signature for __arrow_array__ method in mind? >> >> For example, let's say you want to roundtrip ExtensionArrays or other >> third-party data through Arrow. How do you preserve the required >> metadata? >> >> Regards >> >> Antoine. >> >> >> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit : >> > Hi all, >> > >> > I want to propose an interface to allow custom array objects in Python >> to >> > define how they should be converted to Arrow arrays (e.g. in >> > pyarrow.array(..)). I opened >> > https://issues.apache.org/jira/browse/ARROW-5271 for this. >> > This would be similar to the numpy __array__ protocol (so we could eg >> call >> > it __arrow_array__). >> > Feedback / discussion very welcome! >> > >> > I am coming to this discussion specifically from the point of view of >> > pandas ExtensionArrays (github issue for this: >> > >> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 >> ). >> > Such a protocol would, for example, make it possible that pandas users >> can >> > save DataFrames with ExtensionArrays (eg the nullable integers) to >> parquet, >> > without the need for pyarrow to know about all those possible different >> > extension arrays. This would also be useful for projects extending >> pandas >> > such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher >> > <https://github.com/xhochy/fletcher>. >> > But I suppose it could also be of interest more in general of other >> > array-like / pandas-like projects that want to interface with arrow. >> > >> > Sidenote: for the pandas case, I want to look a the full roundtrip, so >> also >> > the conversion back from an arrow Table to DataFrame. For that aspect >> there >> > is https://issues.apache.org/jira/browse/ARROW-2428, but this is much >> more >> > specific to pandas and its ExtensionArrays. >> > >> > Regards, >> > Joris >> > >> >