Great, thanks Jacques. I'll kick off a vote thread so we can hopefully
get this approved

On Fri, Jun 7, 2019 at 3:27 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> I'm good with this. The consistent separator is a good improvement.
>
> On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hey Jacques,
> >
> > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <jacq...@apache.org> wrote:
> > >
> > > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > > late to the conversation. Couple thoughts:
> > >
> > > - Should we reconsider having this be an independent optional field as
> > > opposed to overloading customer_metadata? It avoids having the weird
> > string
> > > prefixing behavior
> >
> > This is one option that we've discussed. The downside of this is that
> > it becomes another piece of metadata that Arrow implementations need
> > to mind when they are passing through IPC messages. The idea is that
> > "dumb" readers can simply ignore the metadata but pass it along in a
> > subsequent message. For example, suppose a simplistic data
> > service/microservice that evaluates a filter against record batches
> > coming through. There might be columns with extension types that come
> > through that the service does not recognize.
> >
> > In some implementations the custom_metadata member is preserved in
> > schemas and survives IPC round trips, but this is a feature that IMHO
> > should be implemented consistently in all Arrow implementations. For
> > example, I believe that Java drops the custom_metadata as soon as the
> > IPC protocol is parsed.
> >
> > Admittedly, this is not a huge issue, so if you had an extra member of
> > Field like
> >
> > table ExtensionType {
> >   name: string
> >   metadata: string
> > }
> >
> > ...
> >
> > table Field {
> >   ...
> >   custom_type : ExtensionType
> > }
> >
> > then that would work, too. It's more obtrusive to implementations as
> > readers that do not recognize a type should still mind this metadata
> > and pass it along in subsequent messages. If we embed in
> > custom_metadata then this happens automatically (assuming that
> > custom_metadata is preserved...)
> >
> > > - I'd be inclined to be much more stringent about type naming. Maybe even
> > > make the name multiple parts to force the issue?
> >
> > I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> > also that colon ":" is the designated namespace separator and I've
> > made the metadata keys
> >
> > ARROW:extension:name
> > ARROW:extension:metadata
> >
> > As far as the actual type name, since it's application-defined, it
> > might be better to leave this up to the developer-user. If we defined
> > any "built-in extension types" (things like UUID come to mind) we
> > might want to have a pseudo-namespace like "builtin.uuid",
> > "builtin.ipv6", etc. for these
> >
> > Let me know what you think -- it would be great to start a vote on this
> > soon.
> >
> > Thanks
> > Wes
> >
> > >
> > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi Micah,
> > > >
> > > > I have just updated my PR per your comments with more examples of
> > > > extension types.
> > > >
> > > > https://github.com/apache/arrow/pull/4332
> > > >
> > > > Are there more comments about this? I can start a vote in a couple of
> > > > days absent further opinions.
> > > >
> > > > Can someone volunteer to review David's Java PR? I would like to move
> > > > this along so we have a chance of having working extension types in
> > > > the 0.14 release. A number of people are also interested in bridging
> > > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > > types [1]) and Arrow's ExtensionType
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > [1]:
> > > >
> > https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > > >
> > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Hi Wes,
> > > > > Like I said I think this approach looks good, I think what I'm
> > looking
> > > > for is a little more documentation/examples on how additional types
> > would
> > > > be handled.  I think Tensor would be a good example, we also had
> > questions
> > > > about INET addresses previously, maybe this would be a another good
> > > > illustrative example.  Providing examples of serialized metadata in the
> > > > docs would be useful (clarifying that these are opaque binary blobs,
> > that
> > > > will be passed along to extension type factories?)
> > > > >
> > > > > In this regard, I think it might be good to provide a further
> > > > recommendations for the name of extension types:  What do you think
> > about
> > > > recommend organization/projects namespace them to according to some
> > > > convention, so that there aren't conflicts and extensions can be
> > shared?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > >
> > > > >
> > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>> Hi Micah,
> > > > >>>
> > > > >>> The use cases I'm aware of are mostly coming from proprietary
> > > > applications. My idea was for the extension metadata to be as
> > unobtrusive
> > > > as possible. The only alternative as I see it would be to have an
> > Extension
> > > > value in the Type union which would be more intrusive to applications
> > > > handling data for which they have no special handling. That doesn't
> > seem
> > > > desirable if there are alternatives.
> > > > >>
> > > > >>
> > > > >> The other (3rd) option would be to add an extra member to Field.
> > This
> > > > is also a bit more intrusive than having fields in the custom_metadata
> > > > dictionary.
> > > > >>
> > > > >>>
> > > > >>> As an immediate use case we could use extension types to embed
> > Tensor
> > > > values in Binary arrays.
> > > > >>>
> > > > >>> Wes
> > > > >>>
> > > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> Hi Wes,
> > > > >>>> This approach seems reasonable to me.  I'm a little concerned we
> > > > haven't
> > > > >>>> validated many use-cases against the approach (but I don't see any
> > > > obvious
> > > > >>>> flaws).
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Micah
> > > > >>>>
> > > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com
> > >
> > > > wrote:
> > > > >>>>
> > > > >>>> > As Micah brought up, as part of this we would like to formalize
> > the
> > > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is
> > similar
> > > > to
> > > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If
> > someone
> > > > >>>> > has a different idea about what the prefix should be I'm open to
> > > > other
> > > > >>>> > ideas
> > > > >>>> >
> > > > >>>> > [1] :
> > > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> > > > >>>> >
> > > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <
> > wesmck...@gmail.com>
> > > > wrote:
> > > > >>>> > >
> > > > >>>> > > hi folks,
> > > > >>>> > >
> > > > >>>> > > In a prior mailing list thread from February [1] I brought up
> > some
> > > > >>>> > > work I'd done in C++ to create an API to define custom data
> > types
> > > > that
> > > > >>>> > > can be embedded in built-in Arrow logical types. These are
> > > > serialized
> > > > >>>> > > through IPC by adding special fields to the `custom_metadata`
> > > > member
> > > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if
> > an
> > > > >>>> > > implementation does not understand the custom type, then they
> > can
> > > > >>>> > > still interact with the underlying data if need be, or pass
> > on the
> > > > >>>> > > extension metadata in subsequent IPC messages.
> > > > >>>> > >
> > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4],
> > so to
> > > > >>>> > > help the project move forward I think it's a good time to
> > > > formalize
> > > > >>>> > > this, and if there are disagreements to hash them out now. I
> > have
> > > > just
> > > > >>>> > > opened a PR to the Arrow specification documents [3] that
> > > > describes
> > > > >>>> > > the current state of C++ and also the WIP Java PR.
> > > > >>>> > >
> > > > >>>> > > Any thought about this? If there is consensus about this
> > solution
> > > > >>>> > > approach then I can hold a vote.
> > > > >>>> > >
> > > > >>>> > > Thanks
> > > > >>>> > > Wes
> > > > >>>> > >
> > > > >>>> > > [1]:
> > > > >>>> >
> > > >
> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> > > > >>>> > > [2]:
> > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> > > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> > > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> > > > >>>> >
> > > >
> >

Reply via email to