Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Joris Van den Bossche Fri, 10 May 2019 00:00:52 -0700

My initial idea was to not let this protocol pass metadata around (which
indeed is not possible for arrays).


Currently, metadata are only saved at the level of a Table when converting
from a pandas DataFrame (in Table.from_pandas()). That could continue to be
the case, where Table.from_pandas both stores metadata about the original
pandas dtype, and then uses the protocol to get an arrow array from the
values of each column.

I don't think Arrow currently makes use of the Column's field metadata?

Joris

Op do 9 mei 2019 om 18:20 schreef Antoine Pitrou <anto...@python.org>:

>
> Arrow arrays don't have metadata, so if you want to pass metadata around
> you should at least add a hook for columns as well.
>
> Regards
>
> Antoine.
>
>
> Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit :
> > An additional question might be at which "level" to provide such a hook
> to
> > third-party packages: I proposed for Array, but what for chunked arrays,
> > columns or tables? Maybe at least returning a chunked array should also
> be
> > allowed.
> >
> > Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche <
> > jorisvandenboss...@gmail.com>:
> >
> >> The signature I had in mind is something like:
> >>
> >> def __arrow_array__(self, type : pyarrow.DataType=None) ->
> pyarrow.Array:
> >>
> >> where the function returns a pyarrow.Array, and takes an optional data
> >> type (in case there are multiple ways to convert to a pyarrow Array, and
> >> what can be passed by the user in the type argument in
> pyarrow.array(..) or
> >> in a specified schema).
> >>
> >> But, the above is only for a one way path of custom array to Arrow
> array,
> >> and not enough for a full roundtrip.
> >>
> >> For a full roundtrip in case of a pandas DataFrame, we will still need
> to
> >> save information in metadata independently from __arrow_array__ and have
> >> custom code in pyarrow to deal with pandas DataFrames (of which there is
> >> already a lot). I mentioned this briefly in
> >>
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
> >> / https://issues.apache.org/jira/browse/ARROW-2428, but one option
> could
> >> be to save the name of the pandas extension dtype in the
> pandas_metadata of
> >> an arrow Table (just as already happens for currently supported types),
> and
> >> when exporting back to pandas with to_pandas pyarrow could check if this
> >> extension dtype name is registered with pandas and if so, call a method
> >> there to construct it.
> >>
> >> Joris
> >>
> >> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <anto...@python.org>:
> >>
> >>>
> >>> Hi Joris,
> >>>
> >>> Do you have a signature for __arrow_array__ method in mind?
> >>>
> >>> For example, let's say you want to roundtrip ExtensionArrays or other
> >>> third-party data through Arrow.  How do you preserve the required
> >>> metadata?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
> >>>> Hi all,
> >>>>
> >>>> I want to propose an interface to allow custom array objects in Python
> >>> to
> >>>> define how they should be converted to Arrow arrays (e.g. in
> >>>> pyarrow.array(..)). I opened
> >>>> https://issues.apache.org/jira/browse/ARROW-5271 for this.
> >>>> This would be similar to the numpy __array__ protocol (so we could eg
> >>> call
> >>>> it __arrow_array__).
> >>>> Feedback / discussion very welcome!
> >>>>
> >>>> I am coming to this discussion specifically from the point of view of
> >>>> pandas ExtensionArrays (github issue for this:
> >>>>
> >>>
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
> >>> ).
> >>>> Such a protocol would, for example, make it possible that pandas users
> >>> can
> >>>> save DataFrames with ExtensionArrays (eg the nullable integers) to
> >>> parquet,
> >>>> without the need for pyarrow to know about all those possible
> different
> >>>> extension arrays. This would also be useful for projects extending
> >>> pandas
> >>>> such as GeoPandas <https://github.com/geopandas/geopandas> and
> Fletcher
> >>>> <https://github.com/xhochy/fletcher>.
> >>>> But I suppose it could also be of interest more in general of other
> >>>> array-like / pandas-like projects that want to interface with arrow.
> >>>>
> >>>> Sidenote: for the pandas case, I want to look a the full roundtrip, so
> >>> also
> >>>> the conversion back from an arrow Table to DataFrame. For that aspect
> >>> there
> >>>> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
> >>> more
> >>>> specific to pandas and its ExtensionArrays.
> >>>>
> >>>> Regards,
> >>>> Joris
> >>>>
> >>>
> >>
> >
>

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

Reply via email to