> I think Parquet is a better place for the variant spec than Arrow.
Parquet is upstream of nearly every project (other than ORC)
log4j is that -but it doesn't mean that it is the right place.
What is key is: what does it mean for parquet to have a variant type in
there? Does it actually make se
As the discussions in the Spark community (
https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj) and in
the Parquet community (
https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z) continue
to decide the spec location, I would like to discuss some of the
implementation de
Thank you Gang, that's sounds like a good idea to me as well
On Fri, Aug 23, 2024 at 8:47 AM Aihua Xu
wrote:
> Thanks Gang for initiating the discussion.
>
> On Fri, Aug 23, 2024 at 2:22 AM Gang Wu wrote:
>
>> Thanks Aihua!
>>
>> I've started the discussion in dev@parquet:
>> https://lists.apac
Thanks Gang for initiating the discussion.
On Fri, Aug 23, 2024 at 2:22 AM Gang Wu wrote:
> Thanks Aihua!
>
> I've started the discussion in dev@parquet:
> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
>
> Best,
> Gang
>
> On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu wrote:
>
>>
Thanks Aihua!
I've started the discussion in dev@parquet:
https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
Best,
Gang
On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu wrote:
> From this thread
> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems
> Spark community
>From this thread
https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems
Spark community is leaning toward moving to Parquet.
Gang, can you help start a discussion in the parquet community on adopting
and maintaining such Variant spec?
On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenl
This seems to straddle that line, in that you can also view this as a way
to represent semi-structured data in a manner that allows for more
efficient querying and computation by breaking out some of its components
into a more structured form.
(I also happen to want a canonical Arrow representatio
Thanks Fokko for providing the discussion from dev@spark!
Happy to see consensus from the creators and looking forward to the next
step!
Best,
Gang
On Thu, Aug 22, 2024 at 4:12 PM Fokko Driesprong wrote:
> Removing the Arrow dev-list from the CC since that's not helpful at this
> point.
>
> Th
Ah, thanks. I've tried to find a rationale and ended up on
https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it
a good description of what you're after?
If so, then I don't think Arrow is a good match. This seems mostly to be
a marshalling format for semi-structured data
Removing the Arrow dev-list from the CC since that's not helpful at this
point.
This thread focuses on: Should we fork the spec into Iceberg, or are we
okay with having this inside a different project? Spark is not preferred,
so Parquet and Arrow are suggested as alternatives. Reading the thread,
I personally believe arrow is a better choice since we will eventually have the
same memory layout but different physical layouts in Parquet, ORC, or other
file formats.
One concern about this option I have is whether the Arrow community is willing
to make this happen and maintain this specific
It seems that we have reached a consensus to some extent that there
should be a new home for the variant spec. The pending question
is whether Parquet or Arrow is a better choice. As a committer from Arrow,
Parquet and ORC communities, I am neutral to choose any and happy to
help with the movement
>
> That being said, I think the most important consideration for now is where
> are the current maintainers / contributors to the variant type. If most of
> them are already PMC members / committers on a project, it becomes a bit
> easier. Otherwise if there isn't much overlap with a project's exi
In being more engine and format agnostic, I agree the Arrow project might
be a good host for such a specification. It seems like we want to move away
from hosting in Spark to make it engine agnostic. But moving into Iceberg
might make it less format agnostic, as I understand multiple formats might
My $0.02 (as an Apache Spark PMC member):
It'd be very unfortunate if there emerges multiple variant specs at the
physical storage layer. The most important thing is interoperability at the
physical storage layer, since that's by far the most expensive to
"convert". Forking will inevitably lead to
I think Parquet might be a better home over Arrow. Ryan already brought up
interesting points, especially with all of the storage related details and
discussions, like shredding.
Another aspect to this is that while working on Variant, we had ideas of
adding a Variant logical type to Parquet. We t
> Parquet is upstream of nearly every project (other than ORC)
I disagree with this statement. There is a difference between being
upstream and being the internal format in use. For example, datafusion,
duckdb, ray, etc. all have parquet upstream but all of them use Arrow as
the internal memory
I think Parquet is a better place for the variant spec than Arrow. Parquet
is upstream of nearly every project (other than ORC) so it is a good place
to standardize and facilitate discussions across communities. There are
also existing relationships and connections to the Parquet community
because
+1 to using Arrow to house the spec. In the interest of expediency I
wonder if we could even store it there "on the side" while we figure out
how to integrate the variant data type with Arrow.
I have a question for those more familiar with the variant spec. Do we
think it could be introduced as
Hi all,
I am one of the main developers for Variant in Apache Spark. David Cashman
(another one of the main Variant developers) and I have been working on
Variant in Spark for a while, and we are excited by the interest from the
Iceberg community!
We have attended some of the Iceberg dev Variant
+ dev@arrow
Thanks for all the valuable suggestions! I am inclined to Micah's idea that
Arrow might be a better host compared to Parquet.
To give more context, I am taking the initiative to add the geometry type
to both Parquet and ORC. I'd like to do the same thing for variant type in
that varia
Thanks all for your discussion.
The Apache Paimon community is also considering support for this
Variant type, without a doubt, we hope to maintain consistency with
Iceberg.
Not only the Paimon community, but also various computing engines need
to adapt to this type, such as Flink and StarRocks.
>
> Thats fair @Micah, so far all the discussions have been direct and off the
> dev list. Would you like to make the request on the public Spark Dev list?
> I would be glad to co-sign, I can also draft up a quick email if you don't
> have time.
I think once we come to consensus, if you have band
>
> I think the Parquet community is the most neutral option available. Would
> anyone else support asking the Spark and Parquet communities to maintain
> the variant spec in Parquet?
This makes sense to me. I'll reiterate that Arrow might be a better
potential home for this for a few different
I would agree that Parquet seems like a reasonable option in terms of fit
and neutrality.
I'd love to get any feedback from others, but assuming there's
general consensus, I feel like we need to engage with those communities and
have an open conversation about the discussions we've had and why we
I support that whole-heartedly. Parquet would be a great neutral location
for the spec.
On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue
wrote:
> I think it's a good idea to reach out to the Spark community and make sure
> we are in agreement. Up until now I think we've been thinking more
> abstractly
I think it's a good idea to reach out to the Spark community and make sure
we are in agreement. Up until now I think we've been thinking more
abstractly about what makes sense but before we make any decision we should
definitely collaborate with the other communities.
I'd also like to suggest an a
>From the iceberg-rust perspective, it could be extremely challenging to keep
>track of both the Spark and Iceberg specifications. Having a single source of
>truth would be much better. I believe this change will also benefit Delta Lake
>if they implement the same approach. Perhaps we can try co
+1 on posting this discussion to dev@spark ML
> I don't think there is anything that would stop us from moving to a joint
project in the future
My concern is that if we don't do this from day 1, we will never ever do
this.
Best,
Gang
On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer
wrote:
> T
Thats fair @Micah, so far all the discussions have been direct and off the
dev list. Would you like to make the request on the public Spark Dev list?
I would be glad to co-sign, I can also draft up a quick email if you don't
have time.
On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield
wrote:
> I
>
> I agree that it would be beneficial to make a sub-project, the main
> problem is political and not logistic. I've been asking for movement from
> other relative projects for a month and we simply haven't gotten anywhere.
I just wanted to double check that these issues were brought directly to
@Gang Wu
I agree that it would be beneficial to make a sub-project, the main problem
is political and not logistic. I've been asking for movement from other
relative projects for a month and we simply haven't gotten anywhere. I
don't think there is anything that would stop us from moving to a join
I’m on board with copying the spec into our repository. However, as we’ve
talked about, it’s not just a straightforward copy—there are already some
divergences. Some of them are under discussion. Iceberg is definitely the
best place for these specs. Engines like Trino and Flink can then rely on
the
Sorry for chiming in late.
>From the discussion in
https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I don't
quite understand why it is logistically complicated to create a sub-project
to hold the variant spec and impl.
IMHO, coping the variant type spec into Apache Iceberg has so
+1 for copying the spec into our repository, I think we need to own it
fully as a part of the table spec, and we can build compatibility through
tests.
-Jack
On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer
wrote:
> I'm not really in favor of linking and annotating as that just makes
> things m
I'm not really in favor of linking and annotating as that just makes things
more complicated and still is essentially forking just with more steps. If
we just track our annotations / modifications to a single commit/version
then we have the same issue again but now you have to go to multiple
sourc
I'm really excited about the introduction of variant type to Iceberg, but I
want to raise concerns about forking the spec.
I feel like preemptively forking would create the situation where we end up
diverging because there's little reason to work with both communities to
evolve in a way that benef
+1 to what's already being said here. It is good to copy the spec to
Iceberg and add context that's specific to Iceberg, but at the same time,
we should maintain compatibility.
Kind regards,
Fokko
Op wo 14 aug 2024 om 15:30 schreef Manu Zhang :
> +1 to copy the spec into our repository. I think
+1 to copy the spec into our repository. I think the best way to keep
compatibility is building integration tests.
Thanks,
Manu
On Wed, Aug 14, 2024 at 8:27 PM Péter Váry
wrote:
> Thanks Russell and Aihua for pushing Variant support!
>
> Given the differences between the supported types and the
Thanks Russell and Aihua for pushing Variant support!
Given the differences between the supported types and the lack of interest
from the other project, I think it is reasonable to duplicate the
specification to our repository.
I would give very strong emphasis on sticking to the Spark spec as muc
Thanks Russell for bringing this up.
This is the main blocker to move forward with the Variant support in
Iceberg and hopefully we can have a consensus. To me, I also feel it makes
more sense to move the spec into Iceberg rather than Spark engine owns it
and we try to keep it compatible with Spark
Hi Y’all,
We’ve hit a bit of a roadblock with the Variant Proposal, while we were
hoping to move the Variant and Shredding specifications from Spark into
Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately,
I think we have a number of issues with just linking to the Spark pro
42 matches
Mail list logo