Great, thanks Jacques. I'll kick off a vote thread so we can hopefully get this approved
On Fri, Jun 7, 2019 at 3:27 PM Jacques Nadeau <jacq...@apache.org> wrote: > > I'm good with this. The consistent separator is a good improvement. > > On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hey Jacques, > > > > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > > > Thanks for pushing this along. I think it is important. Sorry I'm coming > > > late to the conversation. Couple thoughts: > > > > > > - Should we reconsider having this be an independent optional field as > > > opposed to overloading customer_metadata? It avoids having the weird > > string > > > prefixing behavior > > > > This is one option that we've discussed. The downside of this is that > > it becomes another piece of metadata that Arrow implementations need > > to mind when they are passing through IPC messages. The idea is that > > "dumb" readers can simply ignore the metadata but pass it along in a > > subsequent message. For example, suppose a simplistic data > > service/microservice that evaluates a filter against record batches > > coming through. There might be columns with extension types that come > > through that the service does not recognize. > > > > In some implementations the custom_metadata member is preserved in > > schemas and survives IPC round trips, but this is a feature that IMHO > > should be implemented consistently in all Arrow implementations. For > > example, I believe that Java drops the custom_metadata as soon as the > > IPC protocol is parsed. > > > > Admittedly, this is not a huge issue, so if you had an extra member of > > Field like > > > > table ExtensionType { > > name: string > > metadata: string > > } > > > > ... > > > > table Field { > > ... > > custom_type : ExtensionType > > } > > > > then that would work, too. It's more obtrusive to implementations as > > readers that do not recognize a type should still mind this metadata > > and pass it along in subsequent messages. If we embed in > > custom_metadata then this happens automatically (assuming that > > custom_metadata is preserved...) > > > > > - I'd be inclined to be much more stringent about type naming. Maybe even > > > make the name multiple parts to force the issue? > > > > I just updated my PR https://github.com/apache/arrow/pull/4332 to say > > also that colon ":" is the designated namespace separator and I've > > made the metadata keys > > > > ARROW:extension:name > > ARROW:extension:metadata > > > > As far as the actual type name, since it's application-defined, it > > might be better to leave this up to the developer-user. If we defined > > any "built-in extension types" (things like UUID come to mind) we > > might want to have a pseudo-namespace like "builtin.uuid", > > "builtin.ipv6", etc. for these > > > > Let me know what you think -- it would be great to start a vote on this > > soon. > > > > Thanks > > Wes > > > > > > > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > hi Micah, > > > > > > > > I have just updated my PR per your comments with more examples of > > > > extension types. > > > > > > > > https://github.com/apache/arrow/pull/4332 > > > > > > > > Are there more comments about this? I can start a vote in a couple of > > > > days absent further opinions. > > > > > > > > Can someone volunteer to review David's Java PR? I would like to move > > > > this along so we have a chance of having working extension types in > > > > the 0.14 release. A number of people are also interested in bridging > > > > between pandas's ExtensionArray facility (for custom DataFrame column > > > > types [1]) and Arrow's ExtensionType > > > > > > > > Thanks > > > > Wes > > > > > > > > [1]: > > > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > > > > > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com > > > > > > > wrote: > > > > > > > > > > Hi Wes, > > > > > Like I said I think this approach looks good, I think what I'm > > looking > > > > for is a little more documentation/examples on how additional types > > would > > > > be handled. I think Tensor would be a good example, we also had > > questions > > > > about INET addresses previously, maybe this would be a another good > > > > illustrative example. Providing examples of serialized metadata in the > > > > docs would be useful (clarifying that these are opaque binary blobs, > > that > > > > will be passed along to extension type factories?) > > > > > > > > > > In this regard, I think it might be good to provide a further > > > > recommendations for the name of extension types: What do you think > > about > > > > recommend organization/projects namespace them to according to some > > > > convention, so that there aren't conflicts and extensions can be > > shared? > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > > > > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> > > > > wrote: > > > > >> > > > > >> > > > > >> > > > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > >>> > > > > >>> Hi Micah, > > > > >>> > > > > >>> The use cases I'm aware of are mostly coming from proprietary > > > > applications. My idea was for the extension metadata to be as > > unobtrusive > > > > as possible. The only alternative as I see it would be to have an > > Extension > > > > value in the Type union which would be more intrusive to applications > > > > handling data for which they have no special handling. That doesn't > > seem > > > > desirable if there are alternatives. > > > > >> > > > > >> > > > > >> The other (3rd) option would be to add an extra member to Field. > > This > > > > is also a bit more intrusive than having fields in the custom_metadata > > > > dictionary. > > > > >> > > > > >>> > > > > >>> As an immediate use case we could use extension types to embed > > Tensor > > > > values in Binary arrays. > > > > >>> > > > > >>> Wes > > > > >>> > > > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield < > > emkornfi...@gmail.com> > > > > wrote: > > > > >>>> > > > > >>>> Hi Wes, > > > > >>>> This approach seems reasonable to me. I'm a little concerned we > > > > haven't > > > > >>>> validated many use-cases against the approach (but I don't see any > > > > obvious > > > > >>>> flaws). > > > > >>>> > > > > >>>> Thanks, > > > > >>>> Micah > > > > >>>> > > > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com > > > > > > > wrote: > > > > >>>> > > > > >>>> > As Micah brought up, as part of this we would like to formalize > > the > > > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is > > similar > > > > to > > > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If > > someone > > > > >>>> > has a different idea about what the prefix should be I'm open to > > > > other > > > > >>>> > ideas > > > > >>>> > > > > > >>>> > [1] : > > > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files > > > > >>>> > > > > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney < > > wesmck...@gmail.com> > > > > wrote: > > > > >>>> > > > > > > >>>> > > hi folks, > > > > >>>> > > > > > > >>>> > > In a prior mailing list thread from February [1] I brought up > > some > > > > >>>> > > work I'd done in C++ to create an API to define custom data > > types > > > > that > > > > >>>> > > can be embedded in built-in Arrow logical types. These are > > > > serialized > > > > >>>> > > through IPC by adding special fields to the `custom_metadata` > > > > member > > > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if > > an > > > > >>>> > > implementation does not understand the custom type, then they > > can > > > > >>>> > > still interact with the underlying data if need be, or pass > > on the > > > > >>>> > > extension metadata in subsequent IPC messages. > > > > >>>> > > > > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4], > > so to > > > > >>>> > > help the project move forward I think it's a good time to > > > > formalize > > > > >>>> > > this, and if there are disagreements to hash them out now. I > > have > > > > just > > > > >>>> > > opened a PR to the Arrow specification documents [3] that > > > > describes > > > > >>>> > > the current state of C++ and also the WIP Java PR. > > > > >>>> > > > > > > >>>> > > Any thought about this? If there is consensus about this > > solution > > > > >>>> > > approach then I can hold a vote. > > > > >>>> > > > > > > >>>> > > Thanks > > > > >>>> > > Wes > > > > >>>> > > > > > > >>>> > > [1]: > > > > >>>> > > > > > > > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E > > > > >>>> > > [2]: > > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 > > > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332 > > > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251 > > > > >>>> > > > > > > >