Thanks for pushing this along. I think it is important. Sorry I'm coming
late to the conversation. Couple thoughts:

- Should we reconsider having this be an independent optional field as
opposed to overloading customer_metadata? It avoids having the weird string
prefixing behavior
- I'd be inclined to be much more stringent about type naming. Maybe even
make the name multiple parts to force the issue?

On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> I have just updated my PR per your comments with more examples of
> extension types.
>
> https://github.com/apache/arrow/pull/4332
>
> Are there more comments about this? I can start a vote in a couple of
> days absent further opinions.
>
> Can someone volunteer to review David's Java PR? I would like to move
> this along so we have a chance of having working extension types in
> the 0.14 release. A number of people are also interested in bridging
> between pandas's ExtensionArray facility (for custom DataFrame column
> types [1]) and Arrow's ExtensionType
>
> Thanks
> Wes
>
> [1]:
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html
>
> On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Wes,
> > Like I said I think this approach looks good, I think what I'm looking
> for is a little more documentation/examples on how additional types would
> be handled.  I think Tensor would be a good example, we also had questions
> about INET addresses previously, maybe this would be a another good
> illustrative example.  Providing examples of serialized metadata in the
> docs would be useful (clarifying that these are opaque binary blobs, that
> will be passed along to extension type factories?)
> >
> > In this regard, I think it might be good to provide a further
> recommendations for the name of extension types:  What do you think about
> recommend organization/projects namespace them to according to some
> convention, so that there aren't conflicts and extensions can be shared?
> >
> > Thanks,
> > Micah
> >
> >
> >
> > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >>>
> >>> Hi Micah,
> >>>
> >>> The use cases I'm aware of are mostly coming from proprietary
> applications. My idea was for the extension metadata to be as unobtrusive
> as possible. The only alternative as I see it would be to have an Extension
> value in the Type union which would be more intrusive to applications
> handling data for which they have no special handling. That doesn't seem
> desirable if there are alternatives.
> >>
> >>
> >> The other (3rd) option would be to add an extra member to Field. This
> is also a bit more intrusive than having fields in the custom_metadata
> dictionary.
> >>
> >>>
> >>> As an immediate use case we could use extension types to embed Tensor
> values in Binary arrays.
> >>>
> >>> Wes
> >>>
> >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Wes,
> >>>> This approach seems reasonable to me.  I'm a little concerned we
> haven't
> >>>> validated many use-cases against the approach (but I don't see any
> obvious
> >>>> flaws).
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>>
> >>>> > As Micah brought up, as part of this we would like to formalize the
> >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar
> to
> >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone
> >>>> > has a different idea about what the prefix should be I'm open to
> other
> >>>> > ideas
> >>>> >
> >>>> > [1] :
> https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> >>>> >
> >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>> > >
> >>>> > > hi folks,
> >>>> > >
> >>>> > > In a prior mailing list thread from February [1] I brought up some
> >>>> > > work I'd done in C++ to create an API to define custom data types
> that
> >>>> > > can be embedded in built-in Arrow logical types. These are
> serialized
> >>>> > > through IPC by adding special fields to the `custom_metadata`
> member
> >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an
> >>>> > > implementation does not understand the custom type, then they can
> >>>> > > still interact with the underlying data if need be, or pass on the
> >>>> > > extension metadata in subsequent IPC messages.
> >>>> > >
> >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to
> >>>> > > help the project move forward I think it's a good time to
> formalize
> >>>> > > this, and if there are disagreements to hash them out now. I have
> just
> >>>> > > opened a PR to the Arrow specification documents [3] that
> describes
> >>>> > > the current state of C++ and also the WIP Java PR.
> >>>> > >
> >>>> > > Any thought about this? If there is consensus about this solution
> >>>> > > approach then I can hold a vote.
> >>>> > >
> >>>> > > Thanks
> >>>> > > Wes
> >>>> > >
> >>>> > > [1]:
> >>>> >
> https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> >>>> > > [2]:
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> >>>> >
>

Reply via email to