Hi, If I remember correctly one can already pass `types_mapper` to `pa.Table.to_pandas`, to allow Ray or HF Datasets to define their own pandas extension types associated to the arrow extension types. I guess this could also be used until there is a decision to include those types in Arrow or not ?
> On Feb 3, 2023, at 3:26 PM, Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > On Thu, 2 Feb 2023 at 16:06, Clark Zinzow <clarkzin...@gmail.com> wrote: >> >> Hi Alenka, >> >> Great work on the RFC, I'm super excited to see this! I was planning to >> open a similar RFC at some point over the next few weeks, so this just >> saved me a bunch of work. :D >> >> At the Ray project [1], we've developed two tensor extension types >> (originally adapted from the tensor extension type in >> text_extension_for_pandas [2]) that we've continuously extended: a >> fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These >> extension types include both an Arrow side [5] and a Pandas side [6]. We >> would love to contribute anything upstream that's deemed appropriate for >> inclusion, to share our learnings from our users using this extension type >> in production data processing and AI workloads, and to hopefully stay in >> the loop for this RFC as a stakeholder and dev resource. > > Thanks for the feedback, Clark! > We had a look at the Ray implementation before, and as far as I know > the spec itself should be mostly compatible with what you did there. > The main difference is that the current proposal uses a fixed size > list type instead of variable size list (this is for the case of fixed > shape tensors!). But given the fixed size of the tensors, the only > difference is that this avoids the offsets array, and the actual child > array with the flat tensor values should be identical. > > I think one important question for downstream projects like Ray to be > able to adopt this canonical extension type, is the python interface > we provide. If the extension type is implemented in (and registered > by) the Arrow C++ codebase, we can provide a ExtensionType/Array > subclass in pyarrow, and I think it should be possible to provide more > or less the same features as what you implemented (eg zero-copy > conversion to/from numpy arrays). > But as you mention, apart from the arrow side, you also implemented an > equivalent pandas ExtensionDtype, so that a pyarrow.Table with this > tensor type can be converted to/from a pandas.DataFrame. For this > side, I am less sure we want to implement that in pyarrow itself. So > for this topic, we might need to look for some solution that is > sufficient for a project like Ray (and HuggingFace's Datasets project > has the same problem). Currently the pandas ExtensionDtype is returned > by pyarrow.ExtensionType.to_pandas_dtype [1]. But this ties both > classes together, and makes it difficult to implement the pandas > ExtensionDtype externally for a pyarrow.ExtensionType subclass defined > in pyarrow itself. > > Joris > > [1] > https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L90-L99