No concern from me.  It should probably be documented somewhere though :-)

Regards

Antoine.


Le 16/08/2019 à 17:23, Joris Van den Bossche a écrit :
> Coming back to this older thread, I have opened a PR with a proof of
> concept of the proposed protocol to convert third-party array objects to
> arrow: https://github.com/apache/arrow/pull/5106
> In the tests, I added the protocol to pandas' nullable integer array (which
> is currently not supported in the from_pandas conversion) and this converts
> now nicely without much changes.
> 
> Are there remaining concerns about such a protocol?
> 
> --
> 
> Note that the protocol is only for pandas -> arrow conversion (or other
> array-like objects -> arrow). The other way around (arrow -> pandas) is
> more complex and needs further discussion, and also involves the Arrow
> ExtensionTypes (as mentioned below by Wes).
> But I think the protocol will be useful in any case, and we can go ahead
> with that already (for example, not all pandas ExtensionArrays will need to
> map to a Arrow ExtensionType, eg the nullable integers simply map to
> arrow's int64 or fletcher's ExtensionArrays which just wrap a arrow array).
> That said, I have been working on the arrow ExtensionTypes the last days,
> and have been keeping an overview of the issues and needed work in this
> google document:
> https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit?usp=sharing
> (feel free to comment on it). There is also an initial PR to extend the
> support for defining ExtensionTypes in Python (ARROW-5610
> <https://issues.apache.org/jira/browse/ARROW-5610>,
> https://github.com/apache/arrow/pull/5094).
> 
> Joris
> 
> On Fri, 17 May 2019 at 00:28, Wes McKinney <wesmck...@gmail.com> wrote:
> 
>> hi Joris,
>>
>> Somewhat related to this, I want to also point out that we have C++
>> extension types [1]. As part of this, it would also be good to define
>> and document a public API for users to create ExtensionArray
>> subclasses that can be serialized and deserialized using this
>> machinery.
>>
>> As a motivating example, suppose that a Java application has a special
>> data type that can be serialized as a Binary value in Arrow, and we
>> want to be able to receive this special object as a pandas
>> ExtensionArray column, which unboxing into a Python user space type.
>>
>> The ExtensionType can be implemented in Java, and then on the Python
>> side the implementation can occur either in C++ or Python. An API will
>> need to be defined to serializer functions for the pandas
>> ExtensionArray to map the pandas-space type onto the the Arrow-space
>> type. Does this seem like a project you might be able to help drive
>> forward? As a matter of sequencing, we do not yet have the capability
>> to interact with C++ ExtensionType in Python, so we might need to
>> first create callback machinery to enable Arrow extension types to be
>> defined in Python (that call into the C++ ExtensionType registry)
>>
>> - Wes
>>
>> [1]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc
>>
>> On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
>> <jorisvandenboss...@gmail.com> wrote:
>>>
>>> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn <uw...@xhochy.com>:
>>>
>>>> +1 to the idea of adding a protocol to let other objects define their
>> way
>>>> to Arrow structures. For pandas.Series I would expect that they return
>> an
>>>> Arrow Column.
>>>>
>>>> For the Arrow->pandas conversion I have a bit mixed feelings. In the
>>>> normal Fletcher case I would expect that we don't convert anything as
>> we
>>>> represent anything from Arrow with it.
>>>
>>>
>>> Yes, you don't want to convert anything (apart from wrapping the arrow
>>> array into a FletcherArray). But how does Table.to_pandas know that?
>>> Maybe it doesn't need to know that. And then you might write a function
>> in
>>> fletcher to convert a pyarrow Table to a pandas DataFrame with
>>> fletcher-backed columns. But if you want to have this roundtrip
>>> automatically, without the need that each project that defines an
>>> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
>>> needs to have his own "arrow-table-to-pandas-dataframe" converter,
>> pyarrow
>>> needs to have some notion of how to convert back to a pandas
>> ExtensionArray.
>>>
>>>
>>>> For the case where we want to restore the exact pandas DataFrame we had
>>>> before this will become a bit more complicated as we either would need
>> to
>>>> have all third-party libraries to support Arrow via a hook as proposed
>> or
>>>> we also define some kind of other protocol on the pandas side to
>>>> reconstruct ExtensionArrays from Arrow data.
>>>>
>>>
>>> That last one is basically what I proposed in
>>>
>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>>
>>> Thanks Antoine and Uwe for the discussion!
>>>
>>> Joris
>>
> 

Reply via email to