Thanks Wes! I am most interested in the last option, adding Tensor as a logical type, but if it makes sense to embed as a BinaryArray for a first step then that would still be useful too. I'll work on a design doc with a use case and report back. I know there are a lot of different efforts going on right now and I hate to pile more on, but appreciate time for feedback and review.
Best Regards, Bryan On Mon, Mar 25, 2019 at 2:36 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Bryan, > > I agree this would be useful to work out. > > There's a few options: > > * Sending multiple tensors as a sequence of encapsulated IPC messages > (as described in > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst). > There is no conflict with the columnar streaming protocol that > prevents this > * Embedding tensors in BinaryArray columns in some way (e.g. as an > ExtensionType, which we have now in C++) > * Adding Tensor as a logical type (this is essentially ARROW-1614) > > I would like to understand the use cases more precisely. Perhaps you > can write a design document that describes the use cases in detail and > proposed solution? This doesn't fall anywhere on my list of 2019 > priorities but I'm happy to give feedback on discussions and review > PRs where relevant. > > In conjunction with embedding sequences of tensors in a BinaryArray, > we would probably need to first develop a LargeBinaryArray with 64-bit > offsets, so that buffers can be arbitrarily large (well, within 64-bit > address space at least) > > - Wes > > On Fri, Mar 22, 2019 at 1:24 PM Bryan Cutler <cutl...@gmail.com> wrote: > > > > Hi All, > > > > Recently I have been working with the TensorFlow SIG-IO community to > introduce Apache Arrow based Datasets for bringing Arrow data into > TensorFlow. SIG-IO is a community maintained repository focused on > input/output support for TF, see https://github.com/tensorflow/io (a lot > of formats from contrib/ ended up here). Since it is community driven, if > anyone is interested, participation is highly encouraged! > > > > I'm bringing this up for a couple reasons. First, I want to make sure > that this stays in-line with any related efforts within the Arrow project > and welcome any feedback. Secondly, the initial response has been great and > people are excited about using Arrow and looking to use it in other areas > of TF, but I've noticed there has been some confusion about how Arrow > handles tensor data. Specifically, it gets assumed that tensors could be > part of a RecordBatch and could be readily used in an Arrow stream. > > > > I know we have talked about making tensors a logical type for columnar > data before in > https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E > and there is a JIRA ARROW-1614, but since there is work needed to fully > support the current spec for 1.0, I don't think it has moved forward much. > I'm wondering if maybe now is a better time to start working on this? I > think having built-in support for tensor columns would really help to > increase adoption of Arrow in frameworks that use tensor data. What are > other people's thoughts? > > > > Best Regards, > > Bryan > > >