Hi all, Thank you all for participating in the discussion. The feedback received was very helpful!
I have updated the spec according to the discussion here and in the PR [1] plus the talk we had with Rok and Joris. The change in the spec can be found in the Description of the serialization section where dim_names and permutations are now included as an *optional* metadata. Please have a look at the PR [1] and give comments/suggest changes. Once that is ready I will send the new version to the ML for a vote. Rok has also created a google document titled Memory representations of tensors in different languages [2] where he summarizes how other projects and languages represent tensors/n-dim arrays. It gives a nice broader picture of the topic. [1] https://github.com/apache/arrow/pull/33925# [2] https://docs.google.com/document/d/1BG10KyDr62e0_WZqVaHcz90SnnLYmiVryZaayoKpmIA/edit?usp=sharing All well, Alenka On Tue, Feb 14, 2023 at 1:00 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > On Tue, 7 Feb 2023 at 19:32, Quentin Lhoest <quen...@huggingface.co> > wrote: > > > > Hi, > > > > If I remember correctly one can already pass `types_mapper` > > to `pa.Table.to_pandas`, to allow Ray or HF Datasets to define > > their own pandas extension types associated to the arrow > > extension types. I guess this could also be used until there is a > decision > > to include those types in Arrow or not ? > > > > Yes, that's correct (although we should verify this also works to > override this for extension types, i.e. that types_mappers gets the > priority in deciding the resulting pandas extension dtype). > For packages like Ray or HF Datasets, that might be a good enough > solution; for end-users this is less convenient because you need to > specify this any time you do a conversion from arrow to pandas, while > with `to_pandas_dtype` mechanism this gets used by default. > > Joris >