My initial idea was to not let this protocol pass metadata around (which indeed is not possible for arrays).
Currently, metadata are only saved at the level of a Table when converting from a pandas DataFrame (in Table.from_pandas()). That could continue to be the case, where Table.from_pandas both stores metadata about the original pandas dtype, and then uses the protocol to get an arrow array from the values of each column. I don't think Arrow currently makes use of the Column's field metadata? Joris Op do 9 mei 2019 om 18:20 schreef Antoine Pitrou <anto...@python.org>: > > Arrow arrays don't have metadata, so if you want to pass metadata around > you should at least add a hook for columns as well. > > Regards > > Antoine. > > > Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit : > > An additional question might be at which "level" to provide such a hook > to > > third-party packages: I proposed for Array, but what for chunked arrays, > > columns or tables? Maybe at least returning a chunked array should also > be > > allowed. > > > > Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche < > > jorisvandenboss...@gmail.com>: > > > >> The signature I had in mind is something like: > >> > >> def __arrow_array__(self, type : pyarrow.DataType=None) -> > pyarrow.Array: > >> > >> where the function returns a pyarrow.Array, and takes an optional data > >> type (in case there are multiple ways to convert to a pyarrow Array, and > >> what can be passed by the user in the type argument in > pyarrow.array(..) or > >> in a specified schema). > >> > >> But, the above is only for a one way path of custom array to Arrow > array, > >> and not enough for a full roundtrip. > >> > >> For a full roundtrip in case of a pandas DataFrame, we will still need > to > >> save information in metadata independently from __arrow_array__ and have > >> custom code in pyarrow to deal with pandas DataFrames (of which there is > >> already a lot). I mentioned this briefly in > >> > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > >> / https://issues.apache.org/jira/browse/ARROW-2428, but one option > could > >> be to save the name of the pandas extension dtype in the > pandas_metadata of > >> an arrow Table (just as already happens for currently supported types), > and > >> when exporting back to pandas with to_pandas pyarrow could check if this > >> extension dtype name is registered with pandas and if so, call a method > >> there to construct it. > >> > >> Joris > >> > >> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>: > >> > >>> > >>> Hi Joris, > >>> > >>> Do you have a signature for __arrow_array__ method in mind? > >>> > >>> For example, let's say you want to roundtrip ExtensionArrays or other > >>> third-party data through Arrow. How do you preserve the required > >>> metadata? > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit : > >>>> Hi all, > >>>> > >>>> I want to propose an interface to allow custom array objects in Python > >>> to > >>>> define how they should be converted to Arrow arrays (e.g. in > >>>> pyarrow.array(..)). I opened > >>>> https://issues.apache.org/jira/browse/ARROW-5271 for this. > >>>> This would be similar to the numpy __array__ protocol (so we could eg > >>> call > >>>> it __arrow_array__). > >>>> Feedback / discussion very welcome! > >>>> > >>>> I am coming to this discussion specifically from the point of view of > >>>> pandas ExtensionArrays (github issue for this: > >>>> > >>> > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > >>> ). > >>>> Such a protocol would, for example, make it possible that pandas users > >>> can > >>>> save DataFrames with ExtensionArrays (eg the nullable integers) to > >>> parquet, > >>>> without the need for pyarrow to know about all those possible > different > >>>> extension arrays. This would also be useful for projects extending > >>> pandas > >>>> such as GeoPandas <https://github.com/geopandas/geopandas> and > Fletcher > >>>> <https://github.com/xhochy/fletcher>. > >>>> But I suppose it could also be of interest more in general of other > >>>> array-like / pandas-like projects that want to interface with arrow. > >>>> > >>>> Sidenote: for the pandas case, I want to look a the full roundtrip, so > >>> also > >>>> the conversion back from an arrow Table to DataFrame. For that aspect > >>> there > >>>> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much > >>> more > >>>> specific to pandas and its ExtensionArrays. > >>>> > >>>> Regards, > >>>> Joris > >>>> > >>> > >> > > >