On Thu, 2 Feb 2023 at 16:06, Clark Zinzow <clarkzin...@gmail.com> wrote: > > Hi Alenka, > > Great work on the RFC, I'm super excited to see this! I was planning to > open a similar RFC at some point over the next few weeks, so this just > saved me a bunch of work. :D > > At the Ray project [1], we've developed two tensor extension types > (originally adapted from the tensor extension type in > text_extension_for_pandas [2]) that we've continuously extended: a > fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These > extension types include both an Arrow side [5] and a Pandas side [6]. We > would love to contribute anything upstream that's deemed appropriate for > inclusion, to share our learnings from our users using this extension type > in production data processing and AI workloads, and to hopefully stay in > the loop for this RFC as a stakeholder and dev resource.
Thanks for the feedback, Clark! We had a look at the Ray implementation before, and as far as I know the spec itself should be mostly compatible with what you did there. The main difference is that the current proposal uses a fixed size list type instead of variable size list (this is for the case of fixed shape tensors!). But given the fixed size of the tensors, the only difference is that this avoids the offsets array, and the actual child array with the flat tensor values should be identical. I think one important question for downstream projects like Ray to be able to adopt this canonical extension type, is the python interface we provide. If the extension type is implemented in (and registered by) the Arrow C++ codebase, we can provide a ExtensionType/Array subclass in pyarrow, and I think it should be possible to provide more or less the same features as what you implemented (eg zero-copy conversion to/from numpy arrays). But as you mention, apart from the arrow side, you also implemented an equivalent pandas ExtensionDtype, so that a pyarrow.Table with this tensor type can be converted to/from a pandas.DataFrame. For this side, I am less sure we want to implement that in pyarrow itself. So for this topic, we might need to look for some solution that is sufficient for a project like Ray (and HuggingFace's Datasets project has the same problem). Currently the pandas ExtensionDtype is returned by pyarrow.ExtensionType.to_pandas_dtype [1]. But this ties both classes together, and makes it difficult to implement the pandas ExtensionDtype externally for a pyarrow.ExtensionType subclass defined in pyarrow itself. Joris [1] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L90-L99