Hi,

If I remember correctly one can already pass `types_mapper`
to `pa.Table.to_pandas`, to allow Ray or HF Datasets to define
their own pandas extension types associated to the arrow
extension types. I guess this could also be used until there is a decision
to include those types in Arrow or not ?

> On Feb 3, 2023, at 3:26 PM, Joris Van den Bossche 
> <jorisvandenboss...@gmail.com> wrote:
> 
> On Thu, 2 Feb 2023 at 16:06, Clark Zinzow <clarkzin...@gmail.com> wrote:
>> 
>> Hi Alenka,
>> 
>> Great work on the RFC, I'm super excited to see this! I was planning to
>> open a similar RFC at some point over the next few weeks, so this just
>> saved me a bunch of work. :D
>> 
>> At the Ray project [1], we've developed two tensor extension types
>> (originally adapted from the tensor extension type in
>> text_extension_for_pandas [2]) that we've continuously extended: a
>> fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These
>> extension types include both an Arrow side [5] and a Pandas side [6]. We
>> would love to contribute anything upstream that's deemed appropriate for
>> inclusion, to share our learnings from our users using this extension type
>> in production data processing and AI workloads, and to hopefully stay in
>> the loop for this RFC as a stakeholder and dev resource.
> 
> Thanks for the feedback, Clark!
> We had a look at the Ray implementation before, and as far as I know
> the spec itself should be mostly compatible with what you did there.
> The main difference is that the current proposal uses a fixed size
> list type instead of variable size list (this is for the case of fixed
> shape tensors!). But given the fixed size of the tensors, the only
> difference is that this avoids the offsets array, and the actual child
> array with the flat tensor values should be identical.
> 
> I think one important question for downstream projects like Ray to be
> able to adopt this canonical extension type, is the python interface
> we provide. If the extension type is implemented in (and registered
> by) the Arrow C++ codebase, we can provide a ExtensionType/Array
> subclass in pyarrow, and I think it should be possible to provide more
> or less the same features as what you implemented (eg zero-copy
> conversion to/from numpy arrays).
> But as you mention, apart from the arrow side, you also implemented an
> equivalent pandas ExtensionDtype, so that a pyarrow.Table with this
> tensor type can be converted to/from a pandas.DataFrame. For this
> side, I am less sure we want to implement that in pyarrow itself. So
> for this topic, we might need to look for some solution that is
> sufficient for a project like Ray (and HuggingFace's Datasets project
> has the same problem). Currently the pandas ExtensionDtype is returned
> by pyarrow.ExtensionType.to_pandas_dtype [1]. But this ties both
> classes together, and makes it difficult to implement the pandas
> ExtensionDtype externally for a pyarrow.ExtensionType subclass defined
> in pyarrow itself.
> 
> Joris
> 
> [1] 
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L90-L99

Reply via email to