Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Aihua Xu
>From this thread https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems Spark community is leaning toward moving to Parquet. Gang, can you help start a discussion in the parquet community on adopting and maintaining such Variant spec? On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenl

Re: Iceberg python library sync

2024-08-22 Thread Jun H.
Hi everyone, FYI, the next community python library sync meeting will be on Tuesday (08/27/2024) at 9 AM (US/Pacific). Here is the meeting agenda for the topics to discuss: https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit#bookmark=kix.76h0j5pwz1gg. Please feel f

Re: Type promotion in v3

2024-08-22 Thread Ryan Blue
Thanks for the discussion, everyone. I think the back and forth between Fokko and Micah helped me understand Micah's position more clear. I can see how some of the challenges that I raised would be solved, like moving the previous field into the metadata of the transformation. I agree with a lot o

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
Thanks for the details! I agree, that the first iteration of the planning doesn't need to contain all of the options. In the long run it would be nice to provide the same configuration options for the BaseIncrementalChangelogScan that we have for the create_changelog_view, so we could rebase the i

Re: [VOTE] Spec changes in preparation for v3

2024-08-22 Thread Ryan Blue
With 14 +1 votes (9 binding), this passed. I'll merge the PR. Thanks, everyone! On Tue, Aug 20, 2024 at 7:27 AM Eduard Tudenhöfner wrote: > +1 > > On Tue, Aug 20, 2024 at 4:16 AM xianjin wrote: > >> +1 (non-binding) >> Sent from my iPhone >> >> On Aug 20, 2024, at 7:56 AM, Manu Zhang wrote: >>

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
Just a note that the functionality to compute net changes was added by Yufei only in Iceberg 1.4.0, in #7326 . On Thu, Aug 22, 2024 at 12:48 PM Wing Yew Poon wrote: > Peter, > > The Spark procedure is implemented by CreateChangelogViewProcedure.java >

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
Peter, The Spark procedure is implemented by CreateChangelogViewProcedure.java . This was already added by Yufei in Iceberg 1.2.0. ChangelogIterator

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
That's good info. I didn't know that we already have the Spark procedure at hand. How does Spark calculate the `changelog_view`? Do we already have an implementation at hand somewhere? Could it be reused? Anyways, if we want to reuse the new changelogscan for the changelog_view as well, then I agr

Re: [DISCUSS] Adding RemovePartitionSpecsUpdate update type to REST

2024-08-22 Thread Ryan Blue
+1 On Tue, Aug 20, 2024 at 1:56 AM Fokko Driesprong wrote: > +1 Thanks for working on this > > Op di 20 aug 2024 om 04:16 schreef xianjin : > >> +1 from my side as well. >> >> Sent from my iPhone >> >> On Aug 20, 2024, at 9:09 AM, Yufei Gu wrote: >> >>  >> >> +1, the new spec looks good to me.

RE: Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-22 Thread Guy Khazma
Hi All, Thanks for the discussion. As @karuppayya’s PR recently got merged and it collects the NDV stats on a table level I would like to revisit the partition stats vs table stats discussion and raise a few points for discussion: 1. The current action collects the NDV stats on a table l

[VOTE] Release Apache Iceberg 1.6.1 RC2

2024-08-22 Thread Carl Steinbach
Hi Everyone, I propose that we release the following RC as the official Apache Iceberg 1.6.1 release. The commit ID is 8e9d59d299be42b0bca9461457cd1e95dbaad086 * This corresponds to the tag: apache-iceberg-1.6.1-rc2 * https://github.com/apache/iceberg/commits/apache-iceberg-1.6.1-rc2 * https://gi

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steve Zhang
Yeah agree on this, I think for changelogscan to convert per snapshot scan to tasks the option b with complete history is the right way. While there shall be an option to configure if net/squashed changes are desired. Also, In spark create_catalog_view, the net changes and compute update cannot

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Carl Steinbach
Working on it now. - Carl On Thu, Aug 22, 2024 at 5:13 AM Jean-Baptiste Onofré wrote: > Yeah, it makes sense (and it was what I expected to be honest :) ). > > Eduard already reviewed and merged, so I think we are good for a new > RC. I guess Carl will prepare a new one soon. > > Thanks ! > > R

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
> It should emit changes for each snapshot in the requested range. Wing Yew has a good point here. +1 On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon wrote: > First, thank you all for your responses to my question. > > For Peter's question, I believe that (b) is the correct behavior. It is > als

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
First, thank you all for your responses to my question. For Peter's question, I believe that (b) is the correct behavior. It is also the current behavior when using copy-on-write (deletes and updates are still supported but not using delete files). A changelog scan is an incremental scan over mult

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Curt Hagenlocher
This seems to straddle that line, in that you can also view this as a way to represent semi-structured data in a manner that allows for more efficient querying and computation by breaking out some of its components into a more structured form. (I also happen to want a canonical Arrow representatio

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
Peter, good question. In this case, (b) is the complete change history. (a) is the squashed version. I would probably check how other changelog systems deal with this scenario. On Thu, Aug 22, 2024 at 3:49 AM Péter Váry wrote: > Technically different, but somewhat similar question: > > What is

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Gang Wu
Thanks Fokko for providing the discussion from dev@spark! Happy to see consensus from the creators and looking forward to the next step! Best, Gang On Thu, Aug 22, 2024 at 4:12 PM Fokko Driesprong wrote: > Removing the Arrow dev-list from the CC since that's not helpful at this > point. > > Th

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Jean-Baptiste Onofré
Yeah, it makes sense (and it was what I expected to be honest :) ). Eduard already reviewed and merged, so I think we are good for a new RC. I guess Carl will prepare a new one soon. Thanks ! Regards JB On Thu, Aug 22, 2024 at 11:54 AM Driesprong, Fokko wrote: > > It was not correctly backport

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Eduard Tudenhöfner
Thanks Fokko, I've just reviewed/merged it. We should be good to do another RC. On Thu, Aug 22, 2024 at 11:54 AM Driesprong, Fokko wrote: > It was not correctly backported, I do think we want to add this since it > fixes a CVE as mentioned earlier. I've created a PR: > https://github.com/apache

Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

2024-08-22 Thread Eduard Tudenhöfner
Hey Steve, I spent some time today and investigated the issue. I think you're on the right track with #10953 and I left a few comments but overall I think it's close. Thanks again for your patience on this one. Eduard On Fri, Aug 16, 2024 at 10:14 

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
Technically different, but somewhat similar question: What is the expected behaviour when the `IncrementalScan` is created for not a single snapshot, but for multiple snapshots? S1 added PK1-V1 S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b) S3 updated PK1-V1b to PK1-V1c (removed P

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Driesprong, Fokko
It was not correctly backported, I do think we want to add this since it fixes a CVE as mentioned earlier. I've created a PR: https://github.com/apache/iceberg/pull/10988 Kind regards, Fokko Op do 22 aug 2024 om 11:35 schreef Jean-Baptiste Onofré : > Hi guys, > > FYI, the reason I mentioned ORC

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Jean-Baptiste Onofré
Hi guys, FYI, the reason I mentioned ORC update is because the PR is "flagged" with milestone 1.6.1. So it's a bit surprising to not have it in 1.6.1. We should at least update the PR/issue removing the 1.6.1 milestone, else it would not be "accurate". Regards JB On Thu, Aug 22, 2024 at 12:04 A

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Ah, thanks. I've tried to find a rationale and ended up on https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it a good description of what you're after? If so, then I don't think Arrow is a good match. This seems mostly to be a marshalling format for semi-structured data

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Fokko Driesprong
Removing the Arrow dev-list from the CC since that's not helpful at this point. This thread focuses on: Should we fork the spec into Iceberg, or are we okay with having this inside a different project? Spark is not preferred, so Parquet and Arrow are suggested as alternatives. Reading the thread,

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Xuanwo
I personally believe arrow is a better choice since we will eventually have the same memory layout but different physical layouts in Parquet, ORC, or other file formats. One concern about this option I have is whether the Arrow community is willing to make this happen and maintain this specific

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-22 Thread Eduard Tudenhöfner
Ah sorry I totally missed that message from JB mentioning ORC. I also didn't see the ORC update in the RC. On Thu, Aug 22, 2024 at 12:06 AM Piotr Findeisen wrote: > Hi Eduard, > > JB wrote > > For the record (maybe it helps users/reviewers), this release includes: >> - ORC 1.9.4 update >> - intr