I'm good with this. The consistent separator is a good improvement. On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <wesmck...@gmail.com> wrote:
> hey Jacques, > > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > Thanks for pushing this along. I think it is important. Sorry I'm coming > > late to the conversation. Couple thoughts: > > > > - Should we reconsider having this be an independent optional field as > > opposed to overloading customer_metadata? It avoids having the weird > string > > prefixing behavior > > This is one option that we've discussed. The downside of this is that > it becomes another piece of metadata that Arrow implementations need > to mind when they are passing through IPC messages. The idea is that > "dumb" readers can simply ignore the metadata but pass it along in a > subsequent message. For example, suppose a simplistic data > service/microservice that evaluates a filter against record batches > coming through. There might be columns with extension types that come > through that the service does not recognize. > > In some implementations the custom_metadata member is preserved in > schemas and survives IPC round trips, but this is a feature that IMHO > should be implemented consistently in all Arrow implementations. For > example, I believe that Java drops the custom_metadata as soon as the > IPC protocol is parsed. > > Admittedly, this is not a huge issue, so if you had an extra member of > Field like > > table ExtensionType { > name: string > metadata: string > } > > ... > > table Field { > ... > custom_type : ExtensionType > } > > then that would work, too. It's more obtrusive to implementations as > readers that do not recognize a type should still mind this metadata > and pass it along in subsequent messages. If we embed in > custom_metadata then this happens automatically (assuming that > custom_metadata is preserved...) > > > - I'd be inclined to be much more stringent about type naming. Maybe even > > make the name multiple parts to force the issue? > > I just updated my PR https://github.com/apache/arrow/pull/4332 to say > also that colon ":" is the designated namespace separator and I've > made the metadata keys > > ARROW:extension:name > ARROW:extension:metadata > > As far as the actual type name, since it's application-defined, it > might be better to leave this up to the developer-user. If we defined > any "built-in extension types" (things like UUID come to mind) we > might want to have a pseudo-namespace like "builtin.uuid", > "builtin.ipv6", etc. for these > > Let me know what you think -- it would be great to start a vote on this > soon. > > Thanks > Wes > > > > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > hi Micah, > > > > > > I have just updated my PR per your comments with more examples of > > > extension types. > > > > > > https://github.com/apache/arrow/pull/4332 > > > > > > Are there more comments about this? I can start a vote in a couple of > > > days absent further opinions. > > > > > > Can someone volunteer to review David's Java PR? I would like to move > > > this along so we have a chance of having working extension types in > > > the 0.14 release. A number of people are also interested in bridging > > > between pandas's ExtensionArray facility (for custom DataFrame column > > > types [1]) and Arrow's ExtensionType > > > > > > Thanks > > > Wes > > > > > > [1]: > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > > > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > > > > Hi Wes, > > > > Like I said I think this approach looks good, I think what I'm > looking > > > for is a little more documentation/examples on how additional types > would > > > be handled. I think Tensor would be a good example, we also had > questions > > > about INET addresses previously, maybe this would be a another good > > > illustrative example. Providing examples of serialized metadata in the > > > docs would be useful (clarifying that these are opaque binary blobs, > that > > > will be passed along to extension type factories?) > > > > > > > > In this regard, I think it might be good to provide a further > > > recommendations for the name of extension types: What do you think > about > > > recommend organization/projects namespace them to according to some > > > convention, so that there aren't conflicts and extensions can be > shared? > > > > > > > > Thanks, > > > > Micah > > > > > > > > > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > >> > > > >> > > > >> > > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > >>> > > > >>> Hi Micah, > > > >>> > > > >>> The use cases I'm aware of are mostly coming from proprietary > > > applications. My idea was for the extension metadata to be as > unobtrusive > > > as possible. The only alternative as I see it would be to have an > Extension > > > value in the Type union which would be more intrusive to applications > > > handling data for which they have no special handling. That doesn't > seem > > > desirable if there are alternatives. > > > >> > > > >> > > > >> The other (3rd) option would be to add an extra member to Field. > This > > > is also a bit more intrusive than having fields in the custom_metadata > > > dictionary. > > > >> > > > >>> > > > >>> As an immediate use case we could use extension types to embed > Tensor > > > values in Binary arrays. > > > >>> > > > >>> Wes > > > >>> > > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > >>>> > > > >>>> Hi Wes, > > > >>>> This approach seems reasonable to me. I'm a little concerned we > > > haven't > > > >>>> validated many use-cases against the approach (but I don't see any > > > obvious > > > >>>> flaws). > > > >>>> > > > >>>> Thanks, > > > >>>> Micah > > > >>>> > > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com > > > > > wrote: > > > >>>> > > > >>>> > As Micah brought up, as part of this we would like to formalize > the > > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is > similar > > > to > > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If > someone > > > >>>> > has a different idea about what the prefix should be I'm open to > > > other > > > >>>> > ideas > > > >>>> > > > > >>>> > [1] : > > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files > > > >>>> > > > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney < > wesmck...@gmail.com> > > > wrote: > > > >>>> > > > > > >>>> > > hi folks, > > > >>>> > > > > > >>>> > > In a prior mailing list thread from February [1] I brought up > some > > > >>>> > > work I'd done in C++ to create an API to define custom data > types > > > that > > > >>>> > > can be embedded in built-in Arrow logical types. These are > > > serialized > > > >>>> > > through IPC by adding special fields to the `custom_metadata` > > > member > > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if > an > > > >>>> > > implementation does not understand the custom type, then they > can > > > >>>> > > still interact with the underlying data if need be, or pass > on the > > > >>>> > > extension metadata in subsequent IPC messages. > > > >>>> > > > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4], > so to > > > >>>> > > help the project move forward I think it's a good time to > > > formalize > > > >>>> > > this, and if there are disagreements to hash them out now. I > have > > > just > > > >>>> > > opened a PR to the Arrow specification documents [3] that > > > describes > > > >>>> > > the current state of C++ and also the WIP Java PR. > > > >>>> > > > > > >>>> > > Any thought about this? If there is consensus about this > solution > > > >>>> > > approach then I can hold a vote. > > > >>>> > > > > > >>>> > > Thanks > > > >>>> > > Wes > > > >>>> > > > > > >>>> > > [1]: > > > >>>> > > > > > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E > > > >>>> > > [2]: > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 > > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332 > > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251 > > > >>>> > > > > >