A short update on the state of this discussion:
* There is an ongoing thread on "GH-33923: [Docs] Tensor canonical
extension type specification" [1]. Discussion is now down mostly to how
would logical layout (strides) information be encoded (if at all) and more
input would be most welcome.
* There were also two adhoc zulip discussions. First around the tensor
canonical extension type proposal [2] that was mostly mirroring the github
discussion [1]. Second was a discussion about support in languages with
row-major layouts (R, Julia, ..) and whether zero-copy exchange is possible
there.
* We seem to have consensus around requiring zero-copy exchange and an open
discussion about how to store logical layout.

[1] https://github.com/apache/arrow/pull/33925
[2]
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Canonical.20extension.20type.20for.20tensors
[3]
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Row.2Fcolumn-major.20in.20R.2FJulia

Rok

On Tue, Feb 7, 2023 at 7:32 PM Quentin Lhoest <quen...@huggingface.co>
wrote:

> Hi,
>
> If I remember correctly one can already pass `types_mapper`
> to `pa.Table.to_pandas`, to allow Ray or HF Datasets to define
> their own pandas extension types associated to the arrow
> extension types. I guess this could also be used until there is a decision
> to include those types in Arrow or not ?
>
> > On Feb 3, 2023, at 3:26 PM, Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
> >
> > On Thu, 2 Feb 2023 at 16:06, Clark Zinzow <clarkzin...@gmail.com> wrote:
> >>
> >> Hi Alenka,
> >>
> >> Great work on the RFC, I'm super excited to see this! I was planning to
> >> open a similar RFC at some point over the next few weeks, so this just
> >> saved me a bunch of work. :D
> >>
> >> At the Ray project [1], we've developed two tensor extension types
> >> (originally adapted from the tensor extension type in
> >> text_extension_for_pandas [2]) that we've continuously extended: a
> >> fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These
> >> extension types include both an Arrow side [5] and a Pandas side [6]. We
> >> would love to contribute anything upstream that's deemed appropriate for
> >> inclusion, to share our learnings from our users using this extension
> type
> >> in production data processing and AI workloads, and to hopefully stay in
> >> the loop for this RFC as a stakeholder and dev resource.
> >
> > Thanks for the feedback, Clark!
> > We had a look at the Ray implementation before, and as far as I know
> > the spec itself should be mostly compatible with what you did there.
> > The main difference is that the current proposal uses a fixed size
> > list type instead of variable size list (this is for the case of fixed
> > shape tensors!). But given the fixed size of the tensors, the only
> > difference is that this avoids the offsets array, and the actual child
> > array with the flat tensor values should be identical.
> >
> > I think one important question for downstream projects like Ray to be
> > able to adopt this canonical extension type, is the python interface
> > we provide. If the extension type is implemented in (and registered
> > by) the Arrow C++ codebase, we can provide a ExtensionType/Array
> > subclass in pyarrow, and I think it should be possible to provide more
> > or less the same features as what you implemented (eg zero-copy
> > conversion to/from numpy arrays).
> > But as you mention, apart from the arrow side, you also implemented an
> > equivalent pandas ExtensionDtype, so that a pyarrow.Table with this
> > tensor type can be converted to/from a pandas.DataFrame. For this
> > side, I am less sure we want to implement that in pyarrow itself. So
> > for this topic, we might need to look for some solution that is
> > sufficient for a project like Ray (and HuggingFace's Datasets project
> > has the same problem). Currently the pandas ExtensionDtype is returned
> > by pyarrow.ExtensionType.to_pandas_dtype [1]. But this ties both
> > classes together, and makes it difficult to implement the pandas
> > ExtensionDtype externally for a pyarrow.ExtensionType subclass defined
> > in pyarrow itself.
> >
> > Joris
> >
> > [1]
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L90-L99
>
>

Reply via email to