hi Micah, I have just updated my PR per your comments with more examples of extension types.
https://github.com/apache/arrow/pull/4332 Are there more comments about this? I can start a vote in a couple of days absent further opinions. Can someone volunteer to review David's Java PR? I would like to move this along so we have a chance of having working extension types in the 0.14 release. A number of people are also interested in bridging between pandas's ExtensionArray facility (for custom DataFrame column types [1]) and Arrow's ExtensionType Thanks Wes [1]: https://pandas.pydata.org/pandas-docs/stable/development/extending.html On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Wes, > Like I said I think this approach looks good, I think what I'm looking for is > a little more documentation/examples on how additional types would be > handled. I think Tensor would be a good example, we also had questions about > INET addresses previously, maybe this would be a another good illustrative > example. Providing examples of serialized metadata in the docs would be > useful (clarifying that these are opaque binary blobs, that will be passed > along to extension type factories?) > > In this regard, I think it might be good to provide a further recommendations > for the name of extension types: What do you think about recommend > organization/projects namespace them to according to some convention, so that > there aren't conflicts and extensions can be shared? > > Thanks, > Micah > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> Hi Micah, >>> >>> The use cases I'm aware of are mostly coming from proprietary applications. >>> My idea was for the extension metadata to be as unobtrusive as possible. >>> The only alternative as I see it would be to have an Extension value in the >>> Type union which would be more intrusive to applications handling data for >>> which they have no special handling. That doesn't seem desirable if there >>> are alternatives. >> >> >> The other (3rd) option would be to add an extra member to Field. This is >> also a bit more intrusive than having fields in the custom_metadata >> dictionary. >> >>> >>> As an immediate use case we could use extension types to embed Tensor >>> values in Binary arrays. >>> >>> Wes >>> >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>>> >>>> Hi Wes, >>>> This approach seems reasonable to me. I'm a little concerned we haven't >>>> validated many use-cases against the approach (but I don't see any obvious >>>> flaws). >>>> >>>> Thanks, >>>> Micah >>>> >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> > As Micah brought up, as part of this we would like to formalize the >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar to >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone >>>> > has a different idea about what the prefix should be I'm open to other >>>> > ideas >>>> > >>>> > [1] : https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files >>>> > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <wesmck...@gmail.com> wrote: >>>> > > >>>> > > hi folks, >>>> > > >>>> > > In a prior mailing list thread from February [1] I brought up some >>>> > > work I'd done in C++ to create an API to define custom data types that >>>> > > can be embedded in built-in Arrow logical types. These are serialized >>>> > > through IPC by adding special fields to the `custom_metadata` member >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an >>>> > > implementation does not understand the custom type, then they can >>>> > > still interact with the underlying data if need be, or pass on the >>>> > > extension metadata in subsequent IPC messages. >>>> > > >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to >>>> > > help the project move forward I think it's a good time to formalize >>>> > > this, and if there are disagreements to hash them out now. I have just >>>> > > opened a PR to the Arrow specification documents [3] that describes >>>> > > the current state of C++ and also the WIP Java PR. >>>> > > >>>> > > Any thought about this? If there is consensus about this solution >>>> > > approach then I can hold a vote. >>>> > > >>>> > > Thanks >>>> > > Wes >>>> > > >>>> > > [1]: >>>> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E >>>> > > [2]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 >>>> > > [3]: https://github.com/apache/arrow/pull/4332 >>>> > > [4]: https://github.com/apache/arrow/pull/4251 >>>> >