Re: [VOTE][Format] Variable shape tensor canonical extension type

Dewey Dunnington Fri, 29 Sep 2023 08:04:39 -0700

+1! Thank you for iterating on this with all of us!


On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim
<ale...@voltrondata.com.invalid> wrote:
>
> +1
> Thanks for pushing this through!
>
> On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
> > Hi all,
> >
> > Following the discussion [1][2] I would like to propose a vote to add
> > variable shape tensor canonical extension type language to
> > CanonicalExtensions.rst [3] as written below.
> > A draft C++ implementation and a Python wrapper can be seen here [2].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept this proposal
> > [ ] +0
> > [ ] -1 Do not accept this proposal because...
> >
> >
> > [1] https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
> > [2] https://github.com/apache/arrow/pull/37166
> > [3]
> >
> > https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
> >
> >
> > Variable shape tensor
> > =====================
> >
> > * Extension name: `arrow.variable_shape_tensor`.
> >
> > * The storage type of the extension is: ``StructArray`` where struct
> >   is composed of **data** and **shape** fields describing a single
> >   tensor per row:
> >
> >   * **data** is a ``List`` holding tensor elements of a single tensor.
> >     Data type of the list elements is uniform across the entire column.
> >   * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor shape
> > where
> >     the size of the list ``ndim`` is equal to the number of dimensions of
> > the
> >     tensor.
> >
> > * Extension type parameters:
> >
> >   * **value_type** = the Arrow data type of individual tensor elements.
> >
> >   Optional parameters describing the logical layout:
> >
> >   * **dim_names** = explicit names of tensor dimensions
> >     as an array. The length of it should be equal to the shape
> >     length and equal to the number of dimensions.
> >
> >     ``dim_names`` can be used if the dimensions have well-known
> >     names and they map to the physical layout (row-major).
> >
> >   * **permutation**  = indices of the desired ordering of the
> >     original dimensions, defined as an array.
> >
> >     The indices contain a permutation of the values [0, 1, .., N-1] where
> >     N is the number of dimensions. The permutation indicates which
> >     dimension of the logical layout corresponds to which dimension of the
> >     physical tensor (the i-th dimension of the logical view corresponds
> >     to the dimension with number ``permutations[i]`` of the physical
> > tensor).
> >
> >     Permutation can be useful in case the logical order of
> >     the tensor is a permutation of the physical order (row-major).
> >
> >     When logical and physical layout are equal, the permutation will always
> >     be ([0, 1, .., N-1]) and can therefore be left out.
> >
> >   * **uniform_dimensions** = indices of dimensions whose sizes are
> >     guaranteed to remain constant. Indices are a subset of all possible
> >     dimension indices ([0, 1, .., N-1]).
> >     The uniform dimensions must still be represented in the ``shape``
> > field,
> >     and must always be the same value for all tensors in the array -- this
> >     allows code to interpret the tensor correctly without accounting for
> >     uniform dimensions while still permitting optional optimizations that
> >     take advantage of the uniformity. ``uniform_dimensions`` can be left
> > out,
> >     in which case it is assumed that all dimensions might be variable.
> >
> >   * **uniform_shape** = shape of the dimensions that are guaranteed to stay
> >     constant over all tensors in the array, with the shape of the ragged
> > dimensions
> >     set to 0.
> >     An array containing a tensor with shape (2, 3, 4) and
> > ``uniform_dimensions``
> >     (0, 2) would have ``uniform_shape`` (2, 0, 4).
> >
> > * Description of the serialization:
> >
> >   The metadata must be a valid JSON object, that optionally includes
> >   dimension names with keys **"dim_names"**, ordering of
> >   dimensions with key **"permutation"**, indices of dimensions whose sizes
> >   are guaranteed to remain constant with key **"uniform_dimensions"** and
> >   shape of those dimensions with key **"uniform_shape"**.
> >   Minimal metadata is an empty JSON object.
> >
> >   - Example of minimal metadata is:
> >
> >     ``{}``
> >
> >   - Example with ``dim_names`` metadata for NCHW ordered data:
> >
> >     ``{ "dim_names": ["C", "H", "W"] }``
> >
> >   - Example with ``uniform_dimensions`` metadata for a set of color images
> >     with variable width:
> >
> >     ``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }``
> >
> >   - Example of permuted 3-dimensional tensor:
> >
> >     ``{ "permutation": [2, 0, 1] }``
> >
> >     This is the physical layout shape and the shape of the logical
> >     layout given an individual tensor of shape [100, 200, 500] would
> >     be ``[500, 100, 200]``.
> >
> > .. note::
> >
> >   With the exception of permutation all other parameters and storage
> >   of VariableShapeTensor define the *physical* storage of the tensor.
> >
> >   For example, consider a tensor with:
> >     shape = [10, 20, 30]
> >     dim_names = [x, y, z]
> >     permutations = [2, 0, 1]
> >
> >   This means the logical tensor has names [z, x, y] and shape [30, 10, 20].
> >
> >   Elements in a variable shape tensor extension array are stored
> >   in row-major/C-contiguous order.
> >
> >
> >
> > Rok
> >

Re: [VOTE][Format] Variable shape tensor canonical extension type

Reply via email to