A short update on the state of this discussion: * There is an ongoing thread on "GH-33923: [Docs] Tensor canonical extension type specification" [1]. Discussion is now down mostly to how would logical layout (strides) information be encoded (if at all) and more input would be most welcome. * There were also two adhoc zulip discussions. First around the tensor canonical extension type proposal [2] that was mostly mirroring the github discussion [1]. Second was a discussion about support in languages with row-major layouts (R, Julia, ..) and whether zero-copy exchange is possible there. * We seem to have consensus around requiring zero-copy exchange and an open discussion about how to store logical layout.
[1] https://github.com/apache/arrow/pull/33925 [2] https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Canonical.20extension.20type.20for.20tensors [3] https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Row.2Fcolumn-major.20in.20R.2FJulia Rok On Tue, Feb 7, 2023 at 7:32 PM Quentin Lhoest <quen...@huggingface.co> wrote: > Hi, > > If I remember correctly one can already pass `types_mapper` > to `pa.Table.to_pandas`, to allow Ray or HF Datasets to define > their own pandas extension types associated to the arrow > extension types. I guess this could also be used until there is a decision > to include those types in Arrow or not ? > > > On Feb 3, 2023, at 3:26 PM, Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > > On Thu, 2 Feb 2023 at 16:06, Clark Zinzow <clarkzin...@gmail.com> wrote: > >> > >> Hi Alenka, > >> > >> Great work on the RFC, I'm super excited to see this! I was planning to > >> open a similar RFC at some point over the next few weeks, so this just > >> saved me a bunch of work. :D > >> > >> At the Ray project [1], we've developed two tensor extension types > >> (originally adapted from the tensor extension type in > >> text_extension_for_pandas [2]) that we've continuously extended: a > >> fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These > >> extension types include both an Arrow side [5] and a Pandas side [6]. We > >> would love to contribute anything upstream that's deemed appropriate for > >> inclusion, to share our learnings from our users using this extension > type > >> in production data processing and AI workloads, and to hopefully stay in > >> the loop for this RFC as a stakeholder and dev resource. > > > > Thanks for the feedback, Clark! > > We had a look at the Ray implementation before, and as far as I know > > the spec itself should be mostly compatible with what you did there. > > The main difference is that the current proposal uses a fixed size > > list type instead of variable size list (this is for the case of fixed > > shape tensors!). But given the fixed size of the tensors, the only > > difference is that this avoids the offsets array, and the actual child > > array with the flat tensor values should be identical. > > > > I think one important question for downstream projects like Ray to be > > able to adopt this canonical extension type, is the python interface > > we provide. If the extension type is implemented in (and registered > > by) the Arrow C++ codebase, we can provide a ExtensionType/Array > > subclass in pyarrow, and I think it should be possible to provide more > > or less the same features as what you implemented (eg zero-copy > > conversion to/from numpy arrays). > > But as you mention, apart from the arrow side, you also implemented an > > equivalent pandas ExtensionDtype, so that a pyarrow.Table with this > > tensor type can be converted to/from a pandas.DataFrame. For this > > side, I am less sure we want to implement that in pyarrow itself. So > > for this topic, we might need to look for some solution that is > > sufficient for a project like Ray (and HuggingFace's Datasets project > > has the same problem). Currently the pandas ExtensionDtype is returned > > by pyarrow.ExtensionType.to_pandas_dtype [1]. But this ties both > > classes together, and makes it difficult to implement the pandas > > ExtensionDtype externally for a pyarrow.ExtensionType subclass defined > > in pyarrow itself. > > > > Joris > > > > [1] > https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L90-L99 > >