Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-15 Thread Alenka Frim
Hi all, Thanks Rok for your view on the Parquet topic. Thank you all for joining in on the discussion and the designing of the spec for the support of fixed shape tensors in Apache Arrow! With 3 binding +1 votes, 2 non-binding +1 votes, and no -1 vote, the vote has passed. The PR with the speci

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-15 Thread Rok Mihevc
Looking at fixed-size-list memory layout [1] I think we better proceed with this proposal and rather optimize the parquet reader/writer, e.g.: [2]. Best, Rok [1] https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout [2] https://github.com/apache/arrow/issues/34510#issuecomment

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-15 Thread Alenka Frim
Thank you for the clarification Adam. All the observations and conclusions you have made are very valuable. Certainly, the fact that Parquet reading is not as fast as it could be is a consequence of choosing a FixedSizeList type as a storage type for the fixed shape tensor extension. Despite that

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-13 Thread Adam Lippai
Hi Alenka, We didn’t discuss or benchmark the alternative formats. My understanding is that the best should be similar to an primitive double Arrow column. Currently the parquet (de)serialization takes 3x longer than desired for the new Tensor type. That sounds more than “chasing the last 20% of p

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-13 Thread Alenka Frim
Hi Adam, you are referring to the issue you raised on the Arrow repo [1] that turned into a good discussion about FixedSizeList and the current conversion to Parquet. Please correct me if I am wrong, but the outcome of the discussion was that the conversion is still pretty fast (much faster than

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-10 Thread Adam Lippai
Since the specification explicitly mentions FixedSizeList, but the current conversion to/from parquet is expensive compared to doubles and other primitives (the nested type needs repetition and definition levels) should we discuss what’s the recommendation when integrating with other non-arrow syst

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-07 Thread Alenka Frim
> > Just one comment, though: since we also define a separate "Tensor" IPC > structure in Arrow, maybe we should state the relationship somewhere in the > documentation? (Even if the answer is "no relationship".) > Agree David, thanks for bringing it up. I will add the information about "no relat

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-07 Thread Joris Van den Bossche
+1 (binding) On Tue, 7 Mar 2023 at 23:35, David Li wrote: > > +1 (binding) > > Just one comment, though: since we also define a separate "Tensor" IPC > structure in Arrow, maybe we should state the relationship somewhere in the > documentation? (Even if the answer is "no relationship".) > > On

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-07 Thread David Li
+1 (binding) Just one comment, though: since we also define a separate "Tensor" IPC structure in Arrow, maybe we should state the relationship somewhere in the documentation? (Even if the answer is "no relationship".) On Mon, Mar 6, 2023, at 18:58, Rok Mihevc wrote: > +1 > > Thanks for the disc

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Rok Mihevc
+1 Thanks for the discussion everyone! Rok On Mon, Mar 6, 2023 at 8:29 PM Dewey Dunnington wrote: > +1 (non-binding)! > > On Mon, Mar 6, 2023 at 9:59 AM Nic Crane wrote: > > > +1 > > > > On Mon, 6 Mar 2023 at 12:41, Alenka Frim .invalid> > > wrote: > > > > > Hi all, > > > > > > I am starting

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Dewey Dunnington
+1 (non-binding)! On Mon, Mar 6, 2023 at 9:59 AM Nic Crane wrote: > +1 > > On Mon, 6 Mar 2023 at 12:41, Alenka Frim > wrote: > > > Hi all, > > > > I am starting a new voting thread with this email as the first voting > > thread [1] opened up new > > comments and suggestions and we wanted to tak

Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Nic Crane
+1 On Mon, 6 Mar 2023 at 12:41, Alenka Frim wrote: > Hi all, > > I am starting a new voting thread with this email as the first voting > thread [1] opened up new > comments and suggestions and we wanted to take time to see how that > evolves. > > *I would like to propose we vote on adding the fi

[VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Alenka Frim
Hi all, I am starting a new voting thread with this email as the first voting thread [1] opened up new comments and suggestions and we wanted to take time to see how that evolves. *I would like to propose we vote on adding the fixed shape tensor canonical extension type* *with the following speci

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Alenka Frim
m: Alenka Frim > Sent: Tuesday, February 28, 2023 4:19 AM > To: dev@arrow.apache.org > Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type > > This was actually already meant as the voting thread, but given it sparked > some more discussion, let's give this a

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-28 Thread Kevin Gurney
: dev@arrow.apache.org Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type This was actually already meant as the voting thread, but given it sparked some more discussion, let's give this a few more days, and then re-start with a new vote thread. *So if someone still has comme

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-28 Thread Alenka Frim
I recognize that this proposal is already nearing the > voting phase. > > [1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn > > Best Regards, > > Kevin Gurney > > > From: Rok Mihevc > Sent: Thursday, February 23,

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-24 Thread Kevin Gurney
, Kevin Gurney From: Rok Mihevc Sent: Thursday, February 23, 2023 8:12 AM To: dev@arrow.apache.org Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type That makes sense indeed. Do we have any more comments on the language of the proposal [1] or should we

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-23 Thread Rok Mihevc
That makes sense indeed. Do we have any more comments on the language of the proposal [1] or should we proceed to vote? Rok [1] https://github.com/apache/arrow/pull/33925/files On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou wrote: > > That's a good point. > > Regards > > Antoine. > > > Le 22/0

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-22 Thread Antoine Pitrou
That's a good point. Regards Antoine. Le 22/02/2023 à 14:11, Dewey Dunnington a écrit : I don't think having both dimension names and permutation is redundant...dimension names can also serve as human-readable tags that help a human interpret the values. If reading a NetCDF, for example, on

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-22 Thread Dewey Dunnington
I don't think having both dimension names and permutation is redundant...dimension names can also serve as human-readable tags that help a human interpret the values. If reading a NetCDF, for example, one might store the dimension variable names. When determining type equality it may be useful that

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-22 Thread Rok Mihevc
> > > > > > > Should we rule that `dim_names` and `permutation` are mutually > exclusive? > > > > > > > Since `dim_names` have to "map to the physical layout (row-major)" that > > means permutation will always be trivial which indeed makes it > unnecessary > > to store both. > > I don't think it is

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Alenka Frim
> I would say "the data type of individual tensor elements". > (so that people don't try to make it e.g. List(float64)). Also, I don't think any reference to pyarrow should be made here. Good catch! I have updated the text with: * **value_type** is the data type of individual tensor elements

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Dewey Dunnington
+1! I put together a quick R implementation as well to see how the permutation field fits with our native column-major storage [1]. It worked great! Thank you for all of your work assembling all of our collective opinions on this :-) [1] https://gist.github.com/paleolimbot/c42f068c2b8b98255dbfbe37

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Joris Van den Bossche
On Tue, 21 Feb 2023 at 18:00, Rok Mihevc wrote: > > > > > Should we rule that `dim_names` and `permutation` are mutually exclusive? > > > > Since `dim_names` have to "map to the physical layout (row-major)" that > means permutation will always be trivial which indeed makes it unnecessary > to stor

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Rok Mihevc
> > Should we rule that `dim_names` and `permutation` are mutually exclusive? > Since `dim_names` have to "map to the physical layout (row-major)" that means permutation will always be trivial which indeed makes it unnecessary to store both. (This makes me think about extension type implementation

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Antoine Pitrou
Hi Alenka, Le 21/02/2023 à 13:38, Alenka Frim a écrit : Fixed shape tensor == * Extension name: `arrow.fixed_shape_tensor`. * The storage type of the extension: ``FixedSizeList`` where: * **value_type** is the data type of individual tensors and is an instance of ``

[VOTE] Format: Fixed shape tensor Canonical Extension Type

2023-02-21 Thread Alenka Frim
Hi all, I would like to propose we vote on adding the fixed shape tensor canonical extension type with the following specification: Fixed shape tensor == * Extension name: `arrow.fixed_shape_tensor`. * The storage type of the extension: ``FixedSizeList`` where: * **value_type