hi Micah,

I have just updated my PR per your comments with more examples of
extension types.

https://github.com/apache/arrow/pull/4332

Are there more comments about this? I can start a vote in a couple of
days absent further opinions.

Can someone volunteer to review David's Java PR? I would like to move
this along so we have a chance of having working extension types in
the 0.14 release. A number of people are also interested in bridging
between pandas's ExtensionArray facility (for custom DataFrame column
types [1]) and Arrow's ExtensionType

Thanks
Wes

[1]: https://pandas.pydata.org/pandas-docs/stable/development/extending.html

On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes,
> Like I said I think this approach looks good, I think what I'm looking for is 
> a little more documentation/examples on how additional types would be 
> handled.  I think Tensor would be a good example, we also had questions about 
> INET addresses previously, maybe this would be a another good illustrative 
> example.  Providing examples of serialized metadata in the docs would be 
> useful (clarifying that these are opaque binary blobs, that will be passed 
> along to extension type factories?)
>
> In this regard, I think it might be good to provide a further recommendations 
> for the name of extension types:  What do you think about recommend 
> organization/projects namespace them to according to some convention, so that 
> there aren't conflicts and extensions can be shared?
>
> Thanks,
> Micah
>
>
>
> On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>
>>
>> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>>
>>> Hi Micah,
>>>
>>> The use cases I'm aware of are mostly coming from proprietary applications. 
>>> My idea was for the extension metadata to be as unobtrusive as possible. 
>>> The only alternative as I see it would be to have an Extension value in the 
>>> Type union which would be more intrusive to applications handling data for 
>>> which they have no special handling. That doesn't seem desirable if there 
>>> are alternatives.
>>
>>
>> The other (3rd) option would be to add an extra member to Field. This is 
>> also a bit more intrusive than having fields in the custom_metadata 
>> dictionary.
>>
>>>
>>> As an immediate use case we could use extension types to embed Tensor 
>>> values in Binary arrays.
>>>
>>> Wes
>>>
>>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <emkornfi...@gmail.com> 
>>> wrote:
>>>>
>>>> Hi Wes,
>>>> This approach seems reasonable to me.  I'm a little concerned we haven't
>>>> validated many use-cases against the approach (but I don't see any obvious
>>>> flaws).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>>>
>>>> > As Micah brought up, as part of this we would like to formalize the
>>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar to
>>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone
>>>> > has a different idea about what the prefix should be I'm open to other
>>>> > ideas
>>>> >
>>>> > [1] : https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
>>>> >
>>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>>> > >
>>>> > > hi folks,
>>>> > >
>>>> > > In a prior mailing list thread from February [1] I brought up some
>>>> > > work I'd done in C++ to create an API to define custom data types that
>>>> > > can be embedded in built-in Arrow logical types. These are serialized
>>>> > > through IPC by adding special fields to the `custom_metadata` member
>>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an
>>>> > > implementation does not understand the custom type, then they can
>>>> > > still interact with the underlying data if need be, or pass on the
>>>> > > extension metadata in subsequent IPC messages.
>>>> > >
>>>> > > David Li has put up a WIP PR to implement this for Java [4], so to
>>>> > > help the project move forward I think it's a good time to formalize
>>>> > > this, and if there are disagreements to hash them out now. I have just
>>>> > > opened a PR to the Arrow specification documents [3] that describes
>>>> > > the current state of C++ and also the WIP Java PR.
>>>> > >
>>>> > > Any thought about this? If there is consensus about this solution
>>>> > > approach then I can hold a vote.
>>>> > >
>>>> > > Thanks
>>>> > > Wes
>>>> > >
>>>> > > [1]:
>>>> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
>>>> > > [2]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
>>>> > > [3]: https://github.com/apache/arrow/pull/4332
>>>> > > [4]: https://github.com/apache/arrow/pull/4251
>>>> >

Reply via email to