Hey all!

Besides the recently added FixedShapeTensor [1] canonical extension type
there appears to be a need for an already proposed VariableShapeTensor
[2]. VariableShapeTensor
would store tensors of variable shapes but uniform number of
dimensions, dimension names and dimension permutations.

There are examples of such types: Ray implements
ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].

I propose we discuss adding the below text to
format/CanonicalExtensions.rst to read as [5] and a C++/Python
implementation as proposed in [6]. A vote can be called after a discussion
here.

Variable shape tensor

=====================

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct

  is composed of **data** and **shape** fields describing a single

  tensor per row:

  * **data** is a ``List`` holding tensor elements of a single tensor.

    Data type of the list elements is uniform across the entire column

    and also provided in metadata.

  * **shape** is a ``FixedSizeList`` of the tensor shape where

    the size of the list is equal to the number of dimensions of the

    tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  * **ndim** = the number of dimensions of the tensor.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions

    as an array. The length of it should be equal to the shape

    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known

    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the

    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where

    N is the number of dimensions. The permutation indicates which

    dimension of the logical layout corresponds to which dimension of the

    physical tensor (the i-th dimension of the logical view corresponds

    to the dimension with number ``permutations[i]`` of the physical
tensor).

    Permutation can be useful in case the logical order of

    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always

    be ([0, 1, .., N-1]) and can therefore be left out.

* Description of the serialization:

  The metadata must be a valid JSON object including number of

  dimensions of the contained tensors as an integer with key **"ndim"**

  plus optional dimension names with keys **"dim_names"** and ordering of

  the dimensions with key **"permutation"**.

  - Example: ``{ "ndim": 2}``

  - Example with ``dim_names`` metadata for NCHW ordered data:

    ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``

  - Example of permuted 3-dimensional tensor:

    ``{ "ndim": 3, "permutation": [2, 0, 1]}``

    This is the physical layout shape and the shape of the logical

    layout would given an individual tensor of shape [100, 200, 500]

    be ``[500, 100, 200]``.

.. note::

  Elements in a variable shape tensor extension array are stored

  in row-major/C-contiguous order.


[1] https://github.com/apache/arrow/issues/33924

[2] https://github.com/apache/arrow/issues/24868

[3]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809

[4] https://pytorch.org/docs/stable/nested.html

[5]
https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor

[6] https://github.com/apache/arrow/pull/37166



Best,

Rok

Reply via email to