[VOTE][Format] Variable shape tensor canonical extension type

Rok Mihevc Wed, 27 Sep 2023 05:44:39 -0700

Hi all,

Following the discussion [1][2] I would like to propose a vote to add
variable shape tensor canonical extension type language to
CanonicalExtensions.rst [3] as written below.
A draft C++ implementation and a Python wrapper can be seen here [2].


The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...


[1] https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
[2] https://github.com/apache/arrow/pull/37166
[3]
https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst


Variable shape tensor
=====================

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct
  is composed of **data** and **shape** fields describing a single
  tensor per row:

  * **data** is a ``List`` holding tensor elements of a single tensor.
    Data type of the list elements is uniform across the entire column.
  * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor shape where
    the size of the list ``ndim`` is equal to the number of dimensions of
the
    tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names of tensor dimensions
    as an array. The length of it should be equal to the shape
    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known
    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where
    N is the number of dimensions. The permutation indicates which
    dimension of the logical layout corresponds to which dimension of the
    physical tensor (the i-th dimension of the logical view corresponds
    to the dimension with number ``permutations[i]`` of the physical
tensor).

    Permutation can be useful in case the logical order of
    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always
    be ([0, 1, .., N-1]) and can therefore be left out.

  * **uniform_dimensions** = indices of dimensions whose sizes are
    guaranteed to remain constant. Indices are a subset of all possible
    dimension indices ([0, 1, .., N-1]).
    The uniform dimensions must still be represented in the ``shape`` field,
    and must always be the same value for all tensors in the array -- this
    allows code to interpret the tensor correctly without accounting for
    uniform dimensions while still permitting optional optimizations that
    take advantage of the uniformity. ``uniform_dimensions`` can be left
out,
    in which case it is assumed that all dimensions might be variable.

  * **uniform_shape** = shape of the dimensions that are guaranteed to stay
    constant over all tensors in the array, with the shape of the ragged
dimensions
    set to 0.
    An array containing a tensor with shape (2, 3, 4) and
``uniform_dimensions``
    (0, 2) would have ``uniform_shape`` (2, 0, 4).

* Description of the serialization:

  The metadata must be a valid JSON object, that optionally includes
  dimension names with keys **"dim_names"**, ordering of
  dimensions with key **"permutation"**, indices of dimensions whose sizes
  are guaranteed to remain constant with key **"uniform_dimensions"** and
  shape of those dimensions with key **"uniform_shape"**.
  Minimal metadata is an empty JSON object.

  - Example of minimal metadata is:

    ``{}``

  - Example with ``dim_names`` metadata for NCHW ordered data:

    ``{ "dim_names": ["C", "H", "W"] }``

  - Example with ``uniform_dimensions`` metadata for a set of color images
    with variable width:

    ``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }``

  - Example of permuted 3-dimensional tensor:

    ``{ "permutation": [2, 0, 1] }``

    This is the physical layout shape and the shape of the logical
    layout given an individual tensor of shape [100, 200, 500] would
    be ``[500, 100, 200]``.

.. note::

  With the exception of permutation all other parameters and storage
  of VariableShapeTensor define the *physical* storage of the tensor.

  For example, consider a tensor with:
    shape = [10, 20, 30]
    dim_names = [x, y, z]
    permutations = [2, 0, 1]

  This means the logical tensor has names [z, x, y] and shape [30, 10, 20].

  Elements in a variable shape tensor extension array are stored
  in row-major/C-contiguous order.



Rok

[VOTE][Format] Variable shape tensor canonical extension type

Reply via email to