+1! I put together a quick R implementation as well to see how the permutation field fits with our native column-major storage [1]. It worked great! Thank you for all of your work assembling all of our collective opinions on this :-)
[1] https://gist.github.com/paleolimbot/c42f068c2b8b98255dbfbe379d905607 On Tue, Feb 21, 2023 at 8:39 AM Alenka Frim <ale...@voltrondata.com.invalid> wrote: > Hi all, > > I would like to propose we vote on adding the fixed shape tensor canonical > extension type > with the following specification: > > Fixed shape tensor > ================== > > * Extension name: `arrow.fixed_shape_tensor`. > > * The storage type of the extension: ``FixedSizeList`` where: > > * **value_type** is the data type of individual tensors and > is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``. > * **list_size** is the product of all the elements in tensor shape. > > * Extension type parameters: > > * **value_type** = Arrow DataType of the tensor elements > * **shape** = shape of the contained tensors as an array > > Optional parameters: > > * **dim_names** = explicit names to tensor dimensions > as an array. The length of it should be equal to the shape > length and equal to the number of dimensions. > > ``dim_names`` can be used if the dimensions have well-known > names and they map to the physical layout (row-major). > > * **permutation** = indices of the desired ordering of the > original dimensions, defined as an array. > > The indices contain a permutation of the values [0, 1, .., N-1] where > N is the number of dimensions. The permutation indicates which > dimension of the logical layout corresponds to which dimension of the > physical tensor (the i-th dimension of the logical view corresponds > to the dimension with number ``permutations[i]`` of the physical > tensor). > > Permutation can be useful in case the logical order of > the tensor is a permutation of the physical order (row-major). > > When logical and physical layout are equal, the permutation will always > be ([0, 1, .., N-1]) and can therefore be left out. > > * Description of the serialization: > > The metadata must be a valid JSON object including shape of > the contained tensors as an array with key **"shape"** plus optional > dimension names with keys **"dim_names"** and ordering of the > dimensions with key **"permutation"**. > > - Example: ``{ "shape": [2, 5]}`` > - Example with ``dim_names`` metadata for NCHW ordered data: > > ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}`` > > - Example of permuted 3-dimensional tensor: > > ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}`` > > .. note:: > > Elements in a fixed shape tensor extension array are stored > in row-major/C-contiguous order. > > > * The specification is submitted as a PR [1] to Canonical Extension Types > document under the > format specifications directory [2]. > > There are also two implementations submitted to Apache Arrow repository: > * C++ implementation of the proposed specification [3] > * Python example implementation of the proposed specification and usage > (only illustrative) [4] > > > The vote will be open for at least 72 hours. > > [ ] +1 Accept this proposal > [ ] +0 > [ ] -1 Do not accept this proposal because... > > > Regards, Alenka > > [1]: https://github.com/apache/arrow/pull/33925/files > [2]: > > https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst > > [3]: https://github.com/apache/arrow/pull/8510/files > [4]: https://github.com/apache/arrow/pull/33948/files >