Hi Alenka, Great work on the RFC, I'm super excited to see this! I was planning to open a similar RFC at some point over the next few weeks, so this just saved me a bunch of work. :D
At the Ray project [1], we've developed two tensor extension types (originally adapted from the tensor extension type in text_extension_for_pandas [2]) that we've continuously extended: a fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These extension types include both an Arrow side [5] and a Pandas side [6]. We would love to contribute anything upstream that's deemed appropriate for inclusion, to share our learnings from our users using this extension type in production data processing and AI workloads, and to hopefully stay in the loop for this RFC as a stakeholder and dev resource. One thing that I want to preemptively call out is the importance of zero-copy exchange with tensor libraries in the bindings languages (e.g. NumPy ndarrays for Python), where ideally we would hand off the underlying ndarray data buffers directly to Arrow and vice versa, when possible (boolean data requires a copy due to Arrow's bitpacking and NumPy's lack thereof). This shouldn't impact the underlying extension type spec, just the to/from layer at the bindings level, where I imagine most of the complexity will lie. Thanks again for pushing on this RFC, and I'll try to make time over the next few days to review the spec, C++ implementation, and Python example! Cheers, Clark [1] https://github.com/ray-project/ray/tree/master [2[ https://github.com/CODAIT/text-extensions-for-pandas [3] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L55-L525 [4] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809 [5] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py [6] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/pandas.py On Thu, Feb 2, 2023 at 8:07 AM Alenka Frim <ale...@voltrondata.com.invalid> wrote: > Hi all! > > There have been quite a lot of discussions connected to the tensor support > in Arrow Tables/RecorBatches. Issues to add support for a column in an > Arrow table that has value cells each containing a tensor value, with all > tensors having the same shape/dimensions [1] and a separate one for varying > shape [2] are already created in the Arrow repository. > > Rok Mihevc, Joris Van den Bossche and I would like to start a discussion > about the specification for canonicalizing the fixed shape tensor type in > Arrow: > > Fixed shape tensor > > ================== > > * Extension name: `arrow.fixed_shape_tensor`. > > * The storage type of the extension: ``FixedSizeList`` where: > > * **value_type** is the data type of individual tensors and > > is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``. > > * **list_size** is the product of all the elements in tensor shape. > > * Extension type parameters: > > * **value_type** = Arrow DataType of the tensor elements > > * **shape** = shape of the contained tensors as a tuple > > * Description of the serialization: > > The metadata must be a valid JSON object including shape of > > the contained tensors as an array with key "shape". > > For example: `{ "shape": [2, 5]}` > > .. note:: > > Elements in an fixed shape tensor extension array are stored > > in row-major/C-contiguous order. > > RFC umbrella issue [3] includes: > > - > > Specification for Tensor canonical type extension [4] > - > > C++ implementation of the proposed specification [5] > - > > Python example implementation of the proposed specification and usage > (only illustrative) [6] > > Open questions: > > - > > Should metadata include the "dim_names" key to pass dimension names when > creating the Arrow FixedShapeTensorArray? Do we standardize how to > specify > those names and which names to use? Or the names shouldn't be > standardized > and it would be up to the application to understand them. > > An example for NCHW ordered data [7]: the application could pass > "dim_names": > ["C", "H", "W"] when creating the Arrow FixedShapeTensorArray. > > - > > Should the implementation of the tensor extension type be in Arrow C++ > or should it be implemented in the bindings separately? > > In the future we would like to canonicalize variable shape tensor type in > Arrow also. > > Kind regards, Alenka > > [1]: https://github.com/apache/arrow/issues/15483 > > [2]: https://github.com/apache/arrow/issues/24868 > > [3]: https://github.com/apache/arrow/issues/33924 > > [4]: https://github.com/apache/arrow/issues/33923 > > [5]: https://github.com/apache/arrow/issues/15483 > > [6]: https://github.com/apache/arrow/issues/33947 > [7]: https://machinelearning.wtf/terms/nchw/ >