Hi all,

Thank you Rok for all your valuable work on the Arrow tensors!
I think the proposed spec and implementation are good and I have no
comments on that.

In the PR you mention that "this [ragged dimensions] would be purely
metadata that would help converting arrow <-> jagged/ragged". Are there any
examples available to better understand this metadata and how it would be
used in the conversion you mention?

Thanks!
Alenka

On Wed, Sep 13, 2023 at 2:38 AM Rok Mihevc <rok.mih...@gmail.com> wrote:

> After some discussion on the PR [
> https://github.com/apache/arrow/pull/37166]
> we've altered the proposed type by removing the ndim parameter and
> adding ragged_dimensions one.
> If there is no further feedback I'd like to call for a vote early next
> week. Proposed language now reads:
>
> Variable shape tensor
> =====================
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>   is composed of **data** and **shape** fields describing a single
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
>     Data type of the list elements is uniform across the entire column
>     and also provided in metadata.
>   * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor shape
> where
>     the size of the list ``ndim`` is equal to the number of dimensions of
> the
>     tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
>     as an array. The length of it should be equal to the shape
>     length and equal to the number of dimensions.
>
>     ``dim_names`` can be used if the dimensions have well-known
>     names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>     original dimensions, defined as an array.
>
>     The indices contain a permutation of the values [0, 1, .., N-1] where
>     N is the number of dimensions. The permutation indicates which
>     dimension of the logical layout corresponds to which dimension of the
>     physical tensor (the i-th dimension of the logical view corresponds
>     to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
>     Permutation can be useful in case the logical order of
>     the tensor is a permutation of the physical order (row-major).
>
>     When logical and physical layout are equal, the permutation will always
>     be ([0, 1, .., N-1]) and can therefore be left out.
>
>   * **ragged_dimensions** = indices of ragged dimensions whose sizes may
>     differ. Dimensions where all elements have the same size are called
>     uniform dimensions. Indices are a subset of all possible dimension
>     indices ([0, 1, .., N-1]).
>     Ragged dimensions list can be left out. In that case all dimensions
>     are assumed ragged.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>   dimensions of the contained tensors as an integer with key **"ndim"**
>   plus optional dimension names with keys **"dim_names"** and ordering of
>   the dimensions with key **"permutation"**.
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
>     ``{ "dim_names": ["C", "H", "W"] }``
>
>   - Example with ``ragged_dimensions`` metadata for a set of color images
>     with variable width:
>
>     ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }``
>
>   - Example of permuted 3-dimensional tensor:
>
>     ``{ "permutation": [2, 0, 1] }``
>
>     This is the physical layout shape and the shape of the logical
>     layout would given an individual tensor of shape [100, 200, 500]
>     be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>   in row-major/C-contiguous order.
>
>
> Rok
>

Reply via email to