Hi all!

There have been quite a lot of discussions connected to the tensor support
in Arrow Tables/RecorBatches. Issues to add support for a column in an
Arrow table that has value cells each containing a tensor value, with all
tensors having the same shape/dimensions [1] and a separate one for varying
shape [2] are already created in the Arrow repository.

Rok Mihevc, Joris Van den Bossche and I would like to start a discussion
about the specification for canonicalizing the fixed shape tensor type in
Arrow:

Fixed shape tensor

==================

* Extension name: `arrow.fixed_shape_tensor`.

* The storage type of the extension: ``FixedSizeList`` where:

  * **value_type** is the data type of individual tensors and

    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.

  * **list_size** is the product of all the elements in tensor shape.

* Extension type parameters:

  * **value_type** = Arrow DataType of the tensor elements

  * **shape** = shape of the contained tensors as a tuple

* Description of the serialization:

  The metadata must be a valid JSON object including shape of

  the contained tensors as an array with key "shape".

  For example: `{ "shape": [2, 5]}`

.. note::

  Elements in an fixed shape tensor extension array are stored

  in row-major/C-contiguous order.

RFC umbrella issue [3] includes:

   -

   Specification for Tensor canonical type extension [4]
   -

   C++ implementation of the proposed specification [5]
   -

   Python example implementation of the proposed specification and usage
   (only illustrative) [6]

Open questions:

   -

   Should metadata include the "dim_names" key to pass dimension names when
   creating the Arrow FixedShapeTensorArray? Do we standardize how to specify
   those names and which names to use? Or the names shouldn't be standardized
   and it would be up to the application to understand them.

An example for NCHW ordered data [7]: the application could pass "dim_names":
["C", "H", "W"] when creating the Arrow FixedShapeTensorArray.

   -

   Should the implementation of the tensor extension type be in Arrow C++
   or should it be implemented in the bindings separately?

In the future we would like to canonicalize variable shape tensor type in
Arrow also.

Kind regards, Alenka

[1]: https://github.com/apache/arrow/issues/15483

[2]: https://github.com/apache/arrow/issues/24868

[3]: https://github.com/apache/arrow/issues/33924

[4]: https://github.com/apache/arrow/issues/33923

[5]: https://github.com/apache/arrow/issues/15483

[6]: https://github.com/apache/arrow/issues/33947
[7]: https://machinelearning.wtf/terms/nchw/

Reply via email to