Hi all,
Thanks Rok for your view on the Parquet topic.
Thank you all for joining in on the discussion and the designing of
the spec for the support of fixed shape tensors in Apache Arrow!
With 3 binding +1 votes, 2 non-binding +1 votes, and no -1 vote, the
vote has passed.
The PR with the speci
Looking at fixed-size-list memory layout [1] I think we better proceed with
this proposal and rather optimize the parquet reader/writer, e.g.: [2].
Best,
Rok
[1]
https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
[2] https://github.com/apache/arrow/issues/34510#issuecomment
Thank you for the clarification Adam.
All the observations and conclusions you have made are very valuable.
Certainly, the fact that Parquet reading is not as fast as it could be is a
consequence of choosing a FixedSizeList type as a storage type for
the fixed shape tensor extension.
Despite that
Hi Alenka,
We didn’t discuss or benchmark the alternative formats. My understanding is
that the best should be similar to an primitive double Arrow column.
Currently the parquet (de)serialization takes 3x longer than desired for
the new Tensor type. That sounds more than “chasing the last 20% of
p
Hi Adam,
you are referring to the issue you raised on the Arrow repo [1] that turned
into a good discussion about FixedSizeList and the current conversion
to Parquet.
Please correct me if I am wrong, but the outcome of the discussion was that
the
conversion is still pretty fast (much faster than
Since the specification explicitly mentions FixedSizeList, but the current
conversion to/from parquet is expensive compared to doubles and other
primitives (the nested type needs repetition and definition levels) should
we discuss what’s the recommendation when integrating with other non-arrow
syst
>
> Just one comment, though: since we also define a separate "Tensor" IPC
> structure in Arrow, maybe we should state the relationship somewhere in the
> documentation? (Even if the answer is "no relationship".)
>
Agree David, thanks for bringing it up.
I will add the information about "no relat
+1 (binding)
On Tue, 7 Mar 2023 at 23:35, David Li wrote:
>
> +1 (binding)
>
> Just one comment, though: since we also define a separate "Tensor" IPC
> structure in Arrow, maybe we should state the relationship somewhere in the
> documentation? (Even if the answer is "no relationship".)
>
> On
+1 (binding)
Just one comment, though: since we also define a separate "Tensor" IPC
structure in Arrow, maybe we should state the relationship somewhere in the
documentation? (Even if the answer is "no relationship".)
On Mon, Mar 6, 2023, at 18:58, Rok Mihevc wrote:
> +1
>
> Thanks for the disc
+1
Thanks for the discussion everyone!
Rok
On Mon, Mar 6, 2023 at 8:29 PM Dewey Dunnington
wrote:
> +1 (non-binding)!
>
> On Mon, Mar 6, 2023 at 9:59 AM Nic Crane wrote:
>
> > +1
> >
> > On Mon, 6 Mar 2023 at 12:41, Alenka Frim .invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > I am starting
+1 (non-binding)!
On Mon, Mar 6, 2023 at 9:59 AM Nic Crane wrote:
> +1
>
> On Mon, 6 Mar 2023 at 12:41, Alenka Frim
> wrote:
>
> > Hi all,
> >
> > I am starting a new voting thread with this email as the first voting
> > thread [1] opened up new
> > comments and suggestions and we wanted to tak
+1
On Mon, 6 Mar 2023 at 12:41, Alenka Frim
wrote:
> Hi all,
>
> I am starting a new voting thread with this email as the first voting
> thread [1] opened up new
> comments and suggestions and we wanted to take time to see how that
> evolves.
>
> *I would like to propose we vote on adding the fi
Hi all,
I am starting a new voting thread with this email as the first voting
thread [1] opened up new
comments and suggestions and we wanted to take time to see how that evolves.
*I would like to propose we vote on adding the fixed shape tensor canonical
extension type*
*with the following speci
m: Alenka Frim
> Sent: Tuesday, February 28, 2023 4:19 AM
> To: dev@arrow.apache.org
> Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
>
> This was actually already meant as the voting thread, but given it sparked
> some more discussion, let's give this a
: dev@arrow.apache.org
Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
This was actually already meant as the voting thread, but given it sparked
some more discussion, let's give this a few more days, and then re-start
with a new vote thread.
*So if someone still has comme
I recognize that this proposal is already nearing the
> voting phase.
>
> [1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn
>
> Best Regards,
>
> Kevin Gurney
>
>
> From: Rok Mihevc
> Sent: Thursday, February 23,
,
Kevin Gurney
From: Rok Mihevc
Sent: Thursday, February 23, 2023 8:12 AM
To: dev@arrow.apache.org
Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
That makes sense indeed.
Do we have any more comments on the language of the proposal [1] or should
we
That makes sense indeed.
Do we have any more comments on the language of the proposal [1] or should
we proceed to vote?
Rok
[1] https://github.com/apache/arrow/pull/33925/files
On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou wrote:
>
> That's a good point.
>
> Regards
>
> Antoine.
>
>
> Le 22/0
That's a good point.
Regards
Antoine.
Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
I don't think having both dimension names and permutation is
redundant...dimension names can also serve as human-readable tags that help
a human interpret the values. If reading a NetCDF, for example, on
I don't think having both dimension names and permutation is
redundant...dimension names can also serve as human-readable tags that help
a human interpret the values. If reading a NetCDF, for example, one might
store the dimension variable names. When determining type equality it may
be useful that
>
> > >
> > > Should we rule that `dim_names` and `permutation` are mutually
> exclusive?
> > >
> >
> > Since `dim_names` have to "map to the physical layout (row-major)" that
> > means permutation will always be trivial which indeed makes it
> unnecessary
> > to store both.
>
> I don't think it is
> I would say "the data type of individual tensor elements".
> (so that people don't try to make it e.g. List(float64)).
Also, I don't think any reference to pyarrow should be made here.
Good catch! I have updated the text with:
* **value_type** is the data type of individual tensor elements
+1! I put together a quick R implementation as well to see how the
permutation field fits with our native column-major storage [1]. It worked
great! Thank you for all of your work assembling all of our collective
opinions on this :-)
[1] https://gist.github.com/paleolimbot/c42f068c2b8b98255dbfbe37
On Tue, 21 Feb 2023 at 18:00, Rok Mihevc wrote:
>
> >
> > Should we rule that `dim_names` and `permutation` are mutually exclusive?
> >
>
> Since `dim_names` have to "map to the physical layout (row-major)" that
> means permutation will always be trivial which indeed makes it unnecessary
> to stor
>
> Should we rule that `dim_names` and `permutation` are mutually exclusive?
>
Since `dim_names` have to "map to the physical layout (row-major)" that
means permutation will always be trivial which indeed makes it unnecessary
to store both.
(This makes me think about extension type implementation
Hi Alenka,
Le 21/02/2023 à 13:38, Alenka Frim a écrit :
Fixed shape tensor
==
* Extension name: `arrow.fixed_shape_tensor`.
* The storage type of the extension: ``FixedSizeList`` where:
* **value_type** is the data type of individual tensors and
is an instance of ``
Hi all,
I would like to propose we vote on adding the fixed shape tensor canonical
extension type
with the following specification:
Fixed shape tensor
==
* Extension name: `arrow.fixed_shape_tensor`.
* The storage type of the extension: ``FixedSizeList`` where:
* **value_type
27 matches
Mail list logo