Hi all! There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. Issues to add support for a column in an Arrow table that has value cells each containing a tensor value, with all tensors having the same shape/dimensions [1] and a separate one for varying shape [2] are already created in the Arrow repository.
Rok Mihevc, Joris Van den Bossche and I would like to start a discussion about the specification for canonicalizing the fixed shape tensor type in Arrow: Fixed shape tensor ================== * Extension name: `arrow.fixed_shape_tensor`. * The storage type of the extension: ``FixedSizeList`` where: * **value_type** is the data type of individual tensors and is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``. * **list_size** is the product of all the elements in tensor shape. * Extension type parameters: * **value_type** = Arrow DataType of the tensor elements * **shape** = shape of the contained tensors as a tuple * Description of the serialization: The metadata must be a valid JSON object including shape of the contained tensors as an array with key "shape". For example: `{ "shape": [2, 5]}` .. note:: Elements in an fixed shape tensor extension array are stored in row-major/C-contiguous order. RFC umbrella issue [3] includes: - Specification for Tensor canonical type extension [4] - C++ implementation of the proposed specification [5] - Python example implementation of the proposed specification and usage (only illustrative) [6] Open questions: - Should metadata include the "dim_names" key to pass dimension names when creating the Arrow FixedShapeTensorArray? Do we standardize how to specify those names and which names to use? Or the names shouldn't be standardized and it would be up to the application to understand them. An example for NCHW ordered data [7]: the application could pass "dim_names": ["C", "H", "W"] when creating the Arrow FixedShapeTensorArray. - Should the implementation of the tensor extension type be in Arrow C++ or should it be implemented in the bindings separately? In the future we would like to canonicalize variable shape tensor type in Arrow also. Kind regards, Alenka [1]: https://github.com/apache/arrow/issues/15483 [2]: https://github.com/apache/arrow/issues/24868 [3]: https://github.com/apache/arrow/issues/33924 [4]: https://github.com/apache/arrow/issues/33923 [5]: https://github.com/apache/arrow/issues/15483 [6]: https://github.com/apache/arrow/issues/33947 [7]: https://machinelearning.wtf/terms/nchw/