Re: [DISCUSS] Variant Spec Location

2024-08-28 Thread Steve Loughran
> I think Parquet is a better place for the variant spec than Arrow. Parquet is upstream of nearly every project (other than ORC) log4j is that -but it doesn't mean that it is the right place. What is key is: what does it mean for parquet to have a variant type in there? Does it actually make se

Re: [DISCUSS] Variant Spec Location

2024-08-28 Thread Aihua Xu
As the discussions in the Spark community ( https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj) and in the Parquet community ( https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z) continue to decide the spec location, I would like to discuss some of the implementation de

Re: [DISCUSS] Variant Spec Location

2024-08-23 Thread Julien Le Dem
Thank you Gang, that's sounds like a good idea to me as well On Fri, Aug 23, 2024 at 8:47 AM Aihua Xu wrote: > Thanks Gang for initiating the discussion. > > On Fri, Aug 23, 2024 at 2:22 AM Gang Wu wrote: > >> Thanks Aihua! >> >> I've started the discussion in dev@parquet: >> https://lists.apac

Re: [DISCUSS] Variant Spec Location

2024-08-23 Thread Aihua Xu
Thanks Gang for initiating the discussion. On Fri, Aug 23, 2024 at 2:22 AM Gang Wu wrote: > Thanks Aihua! > > I've started the discussion in dev@parquet: > https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z > > Best, > Gang > > On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu wrote: > >>

Re: [DISCUSS] Variant Spec Location

2024-08-23 Thread Gang Wu
Thanks Aihua! I've started the discussion in dev@parquet: https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z Best, Gang On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu wrote: > From this thread > https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems > Spark community

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Aihua Xu
>From this thread https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems Spark community is leaning toward moving to Parquet. Gang, can you help start a discussion in the parquet community on adopting and maintaining such Variant spec? On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenl

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Curt Hagenlocher
This seems to straddle that line, in that you can also view this as a way to represent semi-structured data in a manner that allows for more efficient querying and computation by breaking out some of its components into a more structured form. (I also happen to want a canonical Arrow representatio

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Gang Wu
Thanks Fokko for providing the discussion from dev@spark! Happy to see consensus from the creators and looking forward to the next step! Best, Gang On Thu, Aug 22, 2024 at 4:12 PM Fokko Driesprong wrote: > Removing the Arrow dev-list from the CC since that's not helpful at this > point. > > Th

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Ah, thanks. I've tried to find a rationale and ended up on https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it a good description of what you're after? If so, then I don't think Arrow is a good match. This seems mostly to be a marshalling format for semi-structured data

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Fokko Driesprong
Removing the Arrow dev-list from the CC since that's not helpful at this point. This thread focuses on: Should we fork the spec into Iceberg, or are we okay with having this inside a different project? Spark is not preferred, so Parquet and Arrow are suggested as alternatives. Reading the thread,

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Xuanwo
I personally believe arrow is a better choice since we will eventually have the same memory layout but different physical layouts in Parquet, ORC, or other file formats. One concern about this option I have is whether the Arrow community is willing to make this happen and maintain this specific

Re: [DISCUSS] Variant Spec Location

2024-08-21 Thread Gang Wu
It seems that we have reached a consensus to some extent that there should be a new home for the variant spec. The pending question is whether Parquet or Arrow is a better choice. As a committer from Arrow, Parquet and ORC communities, I am neutral to choose any and happy to help with the movement

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Micah Kornfield
> > That being said, I think the most important consideration for now is where > are the current maintainers / contributors to the variant type. If most of > them are already PMC members / committers on a project, it becomes a bit > easier. Otherwise if there isn't much overlap with a project's exi

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Will Jones
In being more engine and format agnostic, I agree the Arrow project might be a good host for such a specification. It seems like we want to move away from hosting in Spark to make it engine agnostic. But moving into Iceberg might make it less format agnostic, as I understand multiple formats might

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Reynold Xin
My $0.02 (as an Apache Spark PMC member): It'd be very unfortunate if there emerges multiple variant specs at the physical storage layer. The most important thing is interoperability at the physical storage layer, since that's by far the most expensive to "convert". Forking will inevitably lead to

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
I think Parquet might be a better home over Arrow. Ryan already brought up interesting points, especially with all of the storage related details and discussions, like shredding. Another aspect to this is that while working on Variant, we had ideas of adding a Variant logical type to Parquet. We t

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
> Parquet is upstream of nearly every project (other than ORC) I disagree with this statement. There is a difference between being upstream and being the internal format in use. For example, datafusion, duckdb, ray, etc. all have parquet upstream but all of them use Arrow as the internal memory

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Ryan Blue
I think Parquet is a better place for the variant spec than Arrow. Parquet is upstream of nearly every project (other than ORC) so it is a good place to standardize and facilitate discussions across communities. There are also existing relationships and connections to the Parquet community because

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
+1 to using Arrow to house the spec. In the interest of expediency I wonder if we could even store it there "on the side" while we figure out how to integrate the variant data type with Arrow. I have a question for those more familiar with the variant spec. Do we think it could be introduced as

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
Hi all, I am one of the main developers for Variant in Apache Spark. David Cashman (another one of the main Variant developers) and I have been working on Variant in Spark for a while, and we are excited by the interest from the Iceberg community! We have attended some of the Iceberg dev Variant

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
+ dev@arrow Thanks for all the valuable suggestions! I am inclined to Micah's idea that Arrow might be a better host compared to Parquet. To give more context, I am taking the initiative to add the geometry type to both Parquet and ORC. I'd like to do the same thing for variant type in that varia

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Jingsong Li
Thanks all for your discussion. The Apache Paimon community is also considering support for this Variant type, without a doubt, we hope to maintain consistency with Iceberg. Not only the Paimon community, but also various computing engines need to adapt to this type, such as Flink and StarRocks.

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > Thats fair @Micah, so far all the discussions have been direct and off the > dev list. Would you like to make the request on the public Spark Dev list? > I would be glad to co-sign, I can also draft up a quick email if you don't > have time. I think once we come to consensus, if you have band

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > I think the Parquet community is the most neutral option available. Would > anyone else support asking the Spark and Parquet communities to maintain > the variant spec in Parquet? This makes sense to me. I'll reiterate that Arrow might be a better potential home for this for a few different

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Daniel Weeks
I would agree that Parquet seems like a reasonable option in terms of fit and neutrality. I'd love to get any feedback from others, but assuming there's general consensus, I feel like we need to engage with those communities and have an open conversation about the discussions we've had and why we

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
I support that whole-heartedly. Parquet would be a great neutral location for the spec. On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue wrote: > I think it's a good idea to reach out to the Spark community and make sure > we are in agreement. Up until now I think we've been thinking more > abstractly

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Ryan Blue
I think it's a good idea to reach out to the Spark community and make sure we are in agreement. Up until now I think we've been thinking more abstractly about what makes sense but before we make any decision we should definitely collaborate with the other communities. I'd also like to suggest an a

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Xuanwo
>From the iceberg-rust perspective, it could be extremely challenging to keep >track of both the Spark and Iceberg specifications. Having a single source of >truth would be much better. I believe this change will also benefit Delta Lake >if they implement the same approach. Perhaps we can try co

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
+1 on posting this discussion to dev@spark ML > I don't think there is anything that would stop us from moving to a joint project in the future My concern is that if we don't do this from day 1, we will never ever do this. Best, Gang On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer wrote: > T

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
Thats fair @Micah, so far all the discussions have been direct and off the dev list. Would you like to make the request on the public Spark Dev list? I would be glad to co-sign, I can also draft up a quick email if you don't have time. On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield wrote: > I

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > I agree that it would be beneficial to make a sub-project, the main > problem is political and not logistic. I've been asking for movement from > other relative projects for a month and we simply haven't gotten anywhere. I just wanted to double check that these issues were brought directly to

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
@Gang Wu I agree that it would be beneficial to make a sub-project, the main problem is political and not logistic. I've been asking for movement from other relative projects for a month and we simply haven't gotten anywhere. I don't think there is anything that would stop us from moving to a join

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Yufei Gu
I’m on board with copying the spec into our repository. However, as we’ve talked about, it’s not just a straightforward copy—there are already some divergences. Some of them are under discussion. Iceberg is definitely the best place for these specs. Engines like Trino and Flink can then rely on the

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Gang Wu
Sorry for chiming in late. >From the discussion in https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I don't quite understand why it is logistically complicated to create a sub-project to hold the variant spec and impl. IMHO, coping the variant type spec into Apache Iceberg has so

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Jack Ye
+1 for copying the spec into our repository, I think we need to own it fully as a part of the table spec, and we can build compatibility through tests. -Jack On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer wrote: > I'm not really in favor of linking and annotating as that just makes > things m

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Russell Spitzer
I'm not really in favor of linking and annotating as that just makes things more complicated and still is essentially forking just with more steps. If we just track our annotations / modifications to a single commit/version then we have the same issue again but now you have to go to multiple sourc

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Daniel Weeks
I'm really excited about the introduction of variant type to Iceberg, but I want to raise concerns about forking the spec. I feel like preemptively forking would create the situation where we end up diverging because there's little reason to work with both communities to evolve in a way that benef

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Fokko Driesprong
+1 to what's already being said here. It is good to copy the spec to Iceberg and add context that's specific to Iceberg, but at the same time, we should maintain compatibility. Kind regards, Fokko Op wo 14 aug 2024 om 15:30 schreef Manu Zhang : > +1 to copy the spec into our repository. I think

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Manu Zhang
+1 to copy the spec into our repository. I think the best way to keep compatibility is building integration tests. Thanks, Manu On Wed, Aug 14, 2024 at 8:27 PM Péter Váry wrote: > Thanks Russell and Aihua for pushing Variant support! > > Given the differences between the supported types and the

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Péter Váry
Thanks Russell and Aihua for pushing Variant support! Given the differences between the supported types and the lack of interest from the other project, I think it is reasonable to duplicate the specification to our repository. I would give very strong emphasis on sticking to the Spark spec as muc

Re: [DISCUSS] Variant Spec Location

2024-08-13 Thread Aihua Xu
Thanks Russell for bringing this up. This is the main blocker to move forward with the Variant support in Iceberg and hopefully we can have a consensus. To me, I also feel it makes more sense to move the spec into Iceberg rather than Spark engine owns it and we try to keep it compatible with Spark

[DISCUSS] Variant Spec Location

2024-08-12 Thread Russell Spitzer
Hi Y’all, We’ve hit a bit of a roadblock with the Variant Proposal, while we were hoping to move the Variant and Shredding specifications from Spark into Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately, I think we have a number of issues with just linking to the Spark pro