Worth noting that here were some minor changes made to the spec while the vote was active:
- The "uniform_dimensions" metadata key was removed, since this can also be inferred from the "uniform_shape" information - The shape of non-constant dimensions in the "uniform_shape" entry is now represented by a "null" instead of "0" (this is all about optional metadata) Joris On Fri, 6 Oct 2023 at 13:07, Rok Mihevc <rok.mih...@gmail.com> wrote: > > Hey All, > > We have 4 binding +1 votes, no non-binding +1 votes, and no -1 votes, so > the vote passes. > > Thanks everyone for your work and participation on this! > > As a follow up we will: > [ ] merge changes to the format ( > https://github.com/apache/arrow/pull/37166/files) > [ ] merge C++ and Python implementation ( > https://github.com/apache/arrow/pull/38008) > > > Rok > > On Mon, Oct 2, 2023 at 4:25 PM Rok Mihevc <rok.mih...@gmail.com> wrote: > > > +1 > > Thanks everyone for voting! > > > > I'd like to leave the vote open until Wednesday, > > > > Rok > > > > On Fri, Sep 29, 2023 at 8:58 PM Matt Topol <zotthewiz...@gmail.com> wrote: > > > >> +1 > >> > >> Thanks for all the work here! > >> > >> On Fri, Sep 29, 2023 at 11:04 AM Dewey Dunnington > >> <de...@voltrondata.com.invalid> wrote: > >> > >> > +1! Thank you for iterating on this with all of us! > >> > > >> > On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim > >> > <ale...@voltrondata.com.invalid> wrote: > >> > > > >> > > +1 > >> > > Thanks for pushing this through! > >> > > > >> > > On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc <rok.mih...@gmail.com> > >> wrote: > >> > > > >> > > > Hi all, > >> > > > > >> > > > Following the discussion [1][2] I would like to propose a vote to > >> add > >> > > > variable shape tensor canonical extension type language to > >> > > > CanonicalExtensions.rst [3] as written below. > >> > > > A draft C++ implementation and a Python wrapper can be seen here > >> [2]. > >> > > > > >> > > > The vote will be open for at least 72 hours. > >> > > > > >> > > > [ ] +1 Accept this proposal > >> > > > [ ] +0 > >> > > > [ ] -1 Do not accept this proposal because... > >> > > > > >> > > > > >> > > > [1] > >> https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k > >> > > > [2] https://github.com/apache/arrow/pull/37166 > >> > > > [3] > >> > > > > >> > > > > >> > > >> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst > >> > > > > >> > > > > >> > > > Variable shape tensor > >> > > > ===================== > >> > > > > >> > > > * Extension name: `arrow.variable_shape_tensor`. > >> > > > > >> > > > * The storage type of the extension is: ``StructArray`` where struct > >> > > > is composed of **data** and **shape** fields describing a single > >> > > > tensor per row: > >> > > > > >> > > > * **data** is a ``List`` holding tensor elements of a single > >> tensor. > >> > > > Data type of the list elements is uniform across the entire > >> column. > >> > > > * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor > >> shape > >> > > > where > >> > > > the size of the list ``ndim`` is equal to the number of > >> dimensions > >> > of > >> > > > the > >> > > > tensor. > >> > > > > >> > > > * Extension type parameters: > >> > > > > >> > > > * **value_type** = the Arrow data type of individual tensor > >> elements. > >> > > > > >> > > > Optional parameters describing the logical layout: > >> > > > > >> > > > * **dim_names** = explicit names of tensor dimensions > >> > > > as an array. The length of it should be equal to the shape > >> > > > length and equal to the number of dimensions. > >> > > > > >> > > > ``dim_names`` can be used if the dimensions have well-known > >> > > > names and they map to the physical layout (row-major). > >> > > > > >> > > > * **permutation** = indices of the desired ordering of the > >> > > > original dimensions, defined as an array. > >> > > > > >> > > > The indices contain a permutation of the values [0, 1, .., N-1] > >> > where > >> > > > N is the number of dimensions. The permutation indicates which > >> > > > dimension of the logical layout corresponds to which dimension > >> of > >> > the > >> > > > physical tensor (the i-th dimension of the logical view > >> corresponds > >> > > > to the dimension with number ``permutations[i]`` of the physical > >> > > > tensor). > >> > > > > >> > > > Permutation can be useful in case the logical order of > >> > > > the tensor is a permutation of the physical order (row-major). > >> > > > > >> > > > When logical and physical layout are equal, the permutation will > >> > always > >> > > > be ([0, 1, .., N-1]) and can therefore be left out. > >> > > > > >> > > > * **uniform_dimensions** = indices of dimensions whose sizes are > >> > > > guaranteed to remain constant. Indices are a subset of all > >> possible > >> > > > dimension indices ([0, 1, .., N-1]). > >> > > > The uniform dimensions must still be represented in the > >> ``shape`` > >> > > > field, > >> > > > and must always be the same value for all tensors in the array > >> -- > >> > this > >> > > > allows code to interpret the tensor correctly without accounting > >> > for > >> > > > uniform dimensions while still permitting optional optimizations > >> > that > >> > > > take advantage of the uniformity. ``uniform_dimensions`` can be > >> > left > >> > > > out, > >> > > > in which case it is assumed that all dimensions might be > >> variable. > >> > > > > >> > > > * **uniform_shape** = shape of the dimensions that are guaranteed > >> to > >> > stay > >> > > > constant over all tensors in the array, with the shape of the > >> > ragged > >> > > > dimensions > >> > > > set to 0. > >> > > > An array containing a tensor with shape (2, 3, 4) and > >> > > > ``uniform_dimensions`` > >> > > > (0, 2) would have ``uniform_shape`` (2, 0, 4). > >> > > > > >> > > > * Description of the serialization: > >> > > > > >> > > > The metadata must be a valid JSON object, that optionally includes > >> > > > dimension names with keys **"dim_names"**, ordering of > >> > > > dimensions with key **"permutation"**, indices of dimensions whose > >> > sizes > >> > > > are guaranteed to remain constant with key > >> **"uniform_dimensions"** > >> > and > >> > > > shape of those dimensions with key **"uniform_shape"**. > >> > > > Minimal metadata is an empty JSON object. > >> > > > > >> > > > - Example of minimal metadata is: > >> > > > > >> > > > ``{}`` > >> > > > > >> > > > - Example with ``dim_names`` metadata for NCHW ordered data: > >> > > > > >> > > > ``{ "dim_names": ["C", "H", "W"] }`` > >> > > > > >> > > > - Example with ``uniform_dimensions`` metadata for a set of color > >> > images > >> > > > with variable width: > >> > > > > >> > > > ``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }`` > >> > > > > >> > > > - Example of permuted 3-dimensional tensor: > >> > > > > >> > > > ``{ "permutation": [2, 0, 1] }`` > >> > > > > >> > > > This is the physical layout shape and the shape of the logical > >> > > > layout given an individual tensor of shape [100, 200, 500] would > >> > > > be ``[500, 100, 200]``. > >> > > > > >> > > > .. note:: > >> > > > > >> > > > With the exception of permutation all other parameters and storage > >> > > > of VariableShapeTensor define the *physical* storage of the > >> tensor. > >> > > > > >> > > > For example, consider a tensor with: > >> > > > shape = [10, 20, 30] > >> > > > dim_names = [x, y, z] > >> > > > permutations = [2, 0, 1] > >> > > > > >> > > > This means the logical tensor has names [z, x, y] and shape [30, > >> 10, > >> > 20]. > >> > > > > >> > > > Elements in a variable shape tensor extension array are stored > >> > > > in row-major/C-contiguous order. > >> > > > > >> > > > > >> > > > > >> > > > Rok > >> > > > > >> > > >> > >