hey Jacques, On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <jacq...@apache.org> wrote: > > Thanks for pushing this along. I think it is important. Sorry I'm coming > late to the conversation. Couple thoughts: > > - Should we reconsider having this be an independent optional field as > opposed to overloading customer_metadata? It avoids having the weird string > prefixing behavior
This is one option that we've discussed. The downside of this is that it becomes another piece of metadata that Arrow implementations need to mind when they are passing through IPC messages. The idea is that "dumb" readers can simply ignore the metadata but pass it along in a subsequent message. For example, suppose a simplistic data service/microservice that evaluates a filter against record batches coming through. There might be columns with extension types that come through that the service does not recognize. In some implementations the custom_metadata member is preserved in schemas and survives IPC round trips, but this is a feature that IMHO should be implemented consistently in all Arrow implementations. For example, I believe that Java drops the custom_metadata as soon as the IPC protocol is parsed. Admittedly, this is not a huge issue, so if you had an extra member of Field like table ExtensionType { name: string metadata: string } ... table Field { ... custom_type : ExtensionType } then that would work, too. It's more obtrusive to implementations as readers that do not recognize a type should still mind this metadata and pass it along in subsequent messages. If we embed in custom_metadata then this happens automatically (assuming that custom_metadata is preserved...) > - I'd be inclined to be much more stringent about type naming. Maybe even > make the name multiple parts to force the issue? I just updated my PR https://github.com/apache/arrow/pull/4332 to say also that colon ":" is the designated namespace separator and I've made the metadata keys ARROW:extension:name ARROW:extension:metadata As far as the actual type name, since it's application-defined, it might be better to leave this up to the developer-user. If we defined any "built-in extension types" (things like UUID come to mind) we might want to have a pseudo-namespace like "builtin.uuid", "builtin.ipv6", etc. for these Let me know what you think -- it would be great to start a vote on this soon. Thanks Wes > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Micah, > > > > I have just updated my PR per your comments with more examples of > > extension types. > > > > https://github.com/apache/arrow/pull/4332 > > > > Are there more comments about this? I can start a vote in a couple of > > days absent further opinions. > > > > Can someone volunteer to review David's Java PR? I would like to move > > this along so we have a chance of having working extension types in > > the 0.14 release. A number of people are also interested in bridging > > between pandas's ExtensionArray facility (for custom DataFrame column > > types [1]) and Arrow's ExtensionType > > > > Thanks > > Wes > > > > [1]: > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Hi Wes, > > > Like I said I think this approach looks good, I think what I'm looking > > for is a little more documentation/examples on how additional types would > > be handled. I think Tensor would be a good example, we also had questions > > about INET addresses previously, maybe this would be a another good > > illustrative example. Providing examples of serialized metadata in the > > docs would be useful (clarifying that these are opaque binary blobs, that > > will be passed along to extension type factories?) > > > > > > In this regard, I think it might be good to provide a further > > recommendations for the name of extension types: What do you think about > > recommend organization/projects namespace them to according to some > > convention, so that there aren't conflicts and extensions can be shared? > > > > > > Thanks, > > > Micah > > > > > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >> > > >> > > >> > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> wrote: > > >>> > > >>> Hi Micah, > > >>> > > >>> The use cases I'm aware of are mostly coming from proprietary > > applications. My idea was for the extension metadata to be as unobtrusive > > as possible. The only alternative as I see it would be to have an Extension > > value in the Type union which would be more intrusive to applications > > handling data for which they have no special handling. That doesn't seem > > desirable if there are alternatives. > > >> > > >> > > >> The other (3rd) option would be to add an extra member to Field. This > > is also a bit more intrusive than having fields in the custom_metadata > > dictionary. > > >> > > >>> > > >>> As an immediate use case we could use extension types to embed Tensor > > values in Binary arrays. > > >>> > > >>> Wes > > >>> > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > >>>> > > >>>> Hi Wes, > > >>>> This approach seems reasonable to me. I'm a little concerned we > > haven't > > >>>> validated many use-cases against the approach (but I don't see any > > obvious > > >>>> flaws). > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >>>> > > >>>> > As Micah brought up, as part of this we would like to formalize the > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar > > to > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone > > >>>> > has a different idea about what the prefix should be I'm open to > > other > > >>>> > ideas > > >>>> > > > >>>> > [1] : > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files > > >>>> > > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >>>> > > > > >>>> > > hi folks, > > >>>> > > > > >>>> > > In a prior mailing list thread from February [1] I brought up some > > >>>> > > work I'd done in C++ to create an API to define custom data types > > that > > >>>> > > can be embedded in built-in Arrow logical types. These are > > serialized > > >>>> > > through IPC by adding special fields to the `custom_metadata` > > member > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an > > >>>> > > implementation does not understand the custom type, then they can > > >>>> > > still interact with the underlying data if need be, or pass on the > > >>>> > > extension metadata in subsequent IPC messages. > > >>>> > > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to > > >>>> > > help the project move forward I think it's a good time to > > formalize > > >>>> > > this, and if there are disagreements to hash them out now. I have > > just > > >>>> > > opened a PR to the Arrow specification documents [3] that > > describes > > >>>> > > the current state of C++ and also the WIP Java PR. > > >>>> > > > > >>>> > > Any thought about this? If there is consensus about this solution > > >>>> > > approach then I can hold a vote. > > >>>> > > > > >>>> > > Thanks > > >>>> > > Wes > > >>>> > > > > >>>> > > [1]: > > >>>> > > > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E > > >>>> > > [2]: > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332 > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251 > > >>>> > > >