I'm good with this. The consistent separator is a good improvement.

On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hey Jacques,
>
> On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > late to the conversation. Couple thoughts:
> >
> > - Should we reconsider having this be an independent optional field as
> > opposed to overloading customer_metadata? It avoids having the weird
> string
> > prefixing behavior
>
> This is one option that we've discussed. The downside of this is that
> it becomes another piece of metadata that Arrow implementations need
> to mind when they are passing through IPC messages. The idea is that
> "dumb" readers can simply ignore the metadata but pass it along in a
> subsequent message. For example, suppose a simplistic data
> service/microservice that evaluates a filter against record batches
> coming through. There might be columns with extension types that come
> through that the service does not recognize.
>
> In some implementations the custom_metadata member is preserved in
> schemas and survives IPC round trips, but this is a feature that IMHO
> should be implemented consistently in all Arrow implementations. For
> example, I believe that Java drops the custom_metadata as soon as the
> IPC protocol is parsed.
>
> Admittedly, this is not a huge issue, so if you had an extra member of
> Field like
>
> table ExtensionType {
>   name: string
>   metadata: string
> }
>
> ...
>
> table Field {
>   ...
>   custom_type : ExtensionType
> }
>
> then that would work, too. It's more obtrusive to implementations as
> readers that do not recognize a type should still mind this metadata
> and pass it along in subsequent messages. If we embed in
> custom_metadata then this happens automatically (assuming that
> custom_metadata is preserved...)
>
> > - I'd be inclined to be much more stringent about type naming. Maybe even
> > make the name multiple parts to force the issue?
>
> I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> also that colon ":" is the designated namespace separator and I've
> made the metadata keys
>
> ARROW:extension:name
> ARROW:extension:metadata
>
> As far as the actual type name, since it's application-defined, it
> might be better to leave this up to the developer-user. If we defined
> any "built-in extension types" (things like UUID come to mind) we
> might want to have a pseudo-namespace like "builtin.uuid",
> "builtin.ipv6", etc. for these
>
> Let me know what you think -- it would be great to start a vote on this
> soon.
>
> Thanks
> Wes
>
> >
> > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi Micah,
> > >
> > > I have just updated my PR per your comments with more examples of
> > > extension types.
> > >
> > > https://github.com/apache/arrow/pull/4332
> > >
> > > Are there more comments about this? I can start a vote in a couple of
> > > days absent further opinions.
> > >
> > > Can someone volunteer to review David's Java PR? I would like to move
> > > this along so we have a chance of having working extension types in
> > > the 0.14 release. A number of people are also interested in bridging
> > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > types [1]) and Arrow's ExtensionType
> > >
> > > Thanks
> > > Wes
> > >
> > > [1]:
> > >
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > >
> > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > > >
> > > > Hi Wes,
> > > > Like I said I think this approach looks good, I think what I'm
> looking
> > > for is a little more documentation/examples on how additional types
> would
> > > be handled.  I think Tensor would be a good example, we also had
> questions
> > > about INET addresses previously, maybe this would be a another good
> > > illustrative example.  Providing examples of serialized metadata in the
> > > docs would be useful (clarifying that these are opaque binary blobs,
> that
> > > will be passed along to extension type factories?)
> > > >
> > > > In this regard, I think it might be good to provide a further
> > > recommendations for the name of extension types:  What do you think
> about
> > > recommend organization/projects namespace them to according to some
> > > convention, so that there aren't conflicts and extensions can be
> shared?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > >
> > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > >>>
> > > >>> Hi Micah,
> > > >>>
> > > >>> The use cases I'm aware of are mostly coming from proprietary
> > > applications. My idea was for the extension metadata to be as
> unobtrusive
> > > as possible. The only alternative as I see it would be to have an
> Extension
> > > value in the Type union which would be more intrusive to applications
> > > handling data for which they have no special handling. That doesn't
> seem
> > > desirable if there are alternatives.
> > > >>
> > > >>
> > > >> The other (3rd) option would be to add an extra member to Field.
> This
> > > is also a bit more intrusive than having fields in the custom_metadata
> > > dictionary.
> > > >>
> > > >>>
> > > >>> As an immediate use case we could use extension types to embed
> Tensor
> > > values in Binary arrays.
> > > >>>
> > > >>> Wes
> > > >>>
> > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> Hi Wes,
> > > >>>> This approach seems reasonable to me.  I'm a little concerned we
> > > haven't
> > > >>>> validated many use-cases against the approach (but I don't see any
> > > obvious
> > > >>>> flaws).
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Micah
> > > >>>>
> > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com
> >
> > > wrote:
> > > >>>>
> > > >>>> > As Micah brought up, as part of this we would like to formalize
> the
> > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is
> similar
> > > to
> > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If
> someone
> > > >>>> > has a different idea about what the prefix should be I'm open to
> > > other
> > > >>>> > ideas
> > > >>>> >
> > > >>>> > [1] :
> > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> > > >>>> >
> > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <
> wesmck...@gmail.com>
> > > wrote:
> > > >>>> > >
> > > >>>> > > hi folks,
> > > >>>> > >
> > > >>>> > > In a prior mailing list thread from February [1] I brought up
> some
> > > >>>> > > work I'd done in C++ to create an API to define custom data
> types
> > > that
> > > >>>> > > can be embedded in built-in Arrow logical types. These are
> > > serialized
> > > >>>> > > through IPC by adding special fields to the `custom_metadata`
> > > member
> > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if
> an
> > > >>>> > > implementation does not understand the custom type, then they
> can
> > > >>>> > > still interact with the underlying data if need be, or pass
> on the
> > > >>>> > > extension metadata in subsequent IPC messages.
> > > >>>> > >
> > > >>>> > > David Li has put up a WIP PR to implement this for Java [4],
> so to
> > > >>>> > > help the project move forward I think it's a good time to
> > > formalize
> > > >>>> > > this, and if there are disagreements to hash them out now. I
> have
> > > just
> > > >>>> > > opened a PR to the Arrow specification documents [3] that
> > > describes
> > > >>>> > > the current state of C++ and also the WIP Java PR.
> > > >>>> > >
> > > >>>> > > Any thought about this? If there is consensus about this
> solution
> > > >>>> > > approach then I can hold a vote.
> > > >>>> > >
> > > >>>> > > Thanks
> > > >>>> > > Wes
> > > >>>> > >
> > > >>>> > > [1]:
> > > >>>> >
> > >
> https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> > > >>>> > > [2]:
> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> > > >>>> >
> > >
>

Reply via email to