Re: [VOTE] Deletion Vectors in V3

2024-10-31 Thread Micah Kornfield
+1 (non-binding) On Thu, Oct 31, 2024 at 4:05 PM Steve Zhang wrote: > +1 (non-binding) > > Thanks, > Steve Zhang > > > > On Oct 31, 2024, at 3:41 PM, rdb...@gmail.com wrote: > > +1 > > Thanks, Anton! > > On Wed, Oct 30, 2024 at 11:58 PM Fokko Driesprong > wrote: > >> +1 >> >> I had to read up a

Re: [VOTE] Deletion Vectors in V3

2024-10-31 Thread rdb...@gmail.com
+1 Thanks, Anton! On Wed, Oct 30, 2024 at 11:58 PM Fokko Driesprong wrote: > +1 > > I had to read up a bit, thanks for driving this Anton. > > Kind regards, > Fokko > > Op do 31 okt 2024 om 07:53 schreef Piotr Findeisen < > piotr.findei...@gmail.com>: > >> Thank you Anton, >> >> +1 (non-binding

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
Thank you very much! I will try to document this on the website. On Thu, Oct 31, 2024 at 6:06 PM Szehon Ho wrote: > Yes, that is correct! > > Thanks > Szehon > > On Thu, Oct 31, 2024 at 5:58 PM Pucheng Yang > wrote: > >> Hi Szehon, >> >> Thanks for getting back to me so quickly! >> >> No, I don

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
Yes, that is correct! Thanks Szehon On Thu, Oct 31, 2024 at 5:58 PM Pucheng Yang wrote: > Hi Szehon, > > Thanks for getting back to me so quickly! > > No, I don't see anywhere that is failing. My question is more of a general > question after browsing all the issues. So from what you said, it s

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
Hi Szehon, Thanks for getting back to me so quickly! No, I don't see anywhere that is failing. My question is more of a general question after browsing all the issues. So from what you said, it seems Iceberg in theory can support a very large number of columns (say 100K) w/o hitting any hard limi

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
Hi Pucheng There were some parts in the implementation where column field ids collided with partition field ids. https://github.com/apache/iceberg/pull/10020 introduced mechanisms for affected code to get unique ids, and known places have been fixed. (Particularly the Spark procedure rewrite_posit

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-10-31 Thread Chertara, Rahil
+1 (non-binding) Checked the following: * Verified the signatures * Verified the checksums * Verified the license documentation * Verified the iceberg spark release binary by downloading the runtime jar from the 1.7.0 release repo. Ran a basic iceberg workflow (DDL, DML, DQL) usin

Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
Hey community, I was following https://github.com/apache/iceberg/issues/9220 (Max number of columns) and down the rabbit hole and I found there are a lot of discussions about issues with tables having more than 1k columns. However, after reviewing discussions, it is still a little confusing to me

Re: [VOTE] Deletion Vectors in V3

2024-10-31 Thread Steve Zhang
+1 (non-binding) Thanks, Steve Zhang > On Oct 31, 2024, at 3:41 PM, rdb...@gmail.com wrote: > > +1 > > Thanks, Anton! > > On Wed, Oct 30, 2024 at 11:58 PM Fokko Driesprong > wrote: >> +1 >> >> I had to read up a bit, thanks for driving this Anton. >> >> Kind regar

Re: REST catalog removes void transform

2024-10-31 Thread rdb...@gmail.com
Vladimir, what is the context in which you want to maintain a partition spec with only void transforms? Is this in a v2 table? In a v2 table, the catalog should be free to remove void transforms. They are required for v1. On Wed, Oct 30, 2024 at 5:00 AM Vladimir Ozerov wrote: > Hi, > > When a us

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Micah Kornfield
I agree that equality deletes have their place in streaming. I think the ultimate decision here is how opinionated Iceberg wants to be on its use-cases. If it really wants to stick to its origins of "slow moving data", then removing equality deletes would be inline with this. I think the other h

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Alexander Jo
Hey all, Just to throw my 2 cents in, I agree with Steven and others that we do need some kind of replacement before deprecating equality deletes. They certainly have their problems, and do significantly increase complexity as they are now, but the writing of position deletes is too expensive for

Re: [DISCUSS] Partial Metadata Loading

2024-10-31 Thread Daniel Weeks
Eric, With respect to the credential endpoint, I believe there is important context missing that probably should have been captured in the doc. The credential endpoint is unlike other use cases because the fundamental issue is that refresh is an operation that happens across distributed workers.

[DISCUSS] Change Behavior for SchemaUpdate.UnionByName

2024-10-31 Thread Rocco Varela
Hi everyone, Apologize if this is landing twice, my first attempt got lost somewhere in transit :) I have a PR that attempts to address https://github.com/apache/iceberg/issues/4849, basically adding logic to ignore downcasting column types when "mergeSchema" is set when an existing column type i

Re: [DISCUSS] Partial Metadata Loading

2024-10-31 Thread Eric Maynard
Thanks for this breakdown Dan. I share your concerns about the complexity this might impose on the client. On some of your other notes, I have some thoughts below: Several Apache Polaris (Incubating) committers were in the recent sync on this proposal, so I want to share one perspective related

Re: [DISCUSS] Change Behavior for SchemaUpdate.UnionByName

2024-10-31 Thread Russell Spitzer
I'm in favor of 1 since previously these inputs would have thrown an exception that wasn't really that helpful. @Test public void testDowncastoLongToInt() { Schema currentSchema = new Schema(required(1, "aCol", LongType.get())); Schema newSchema = new Schema(required(1, "aCol", IntegerType.get

Re: [DISCUSS] Partial Metadata Loading

2024-10-31 Thread Daniel Weeks
I'd like to clarify my concerns here because I think there are more aspects to this than we've captured. *Partial metadata loads adds significant complexity to the protocol* Iceberg metadata is a complicated structure and finding a way to represent how and what we want to piece apart is non-trivia

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Steven Wu
We probably all agree with the downside of equality deletes: it postpones all the work on the read path. In theory, we can implement position deletes only in the Flink streaming writer. It would require the tracking of last committed data files per key, which can be stored in Flink state (checkpoi

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-10-31 Thread Russell Spitzer
@Manu Zhang You are definitely right, I'll get that in before I do docs release. If you don't mind I'll skip it in the source though so I don't have to do another full release. On Thu, Oct 31, 2024 at 7:35 AM Jean-Baptiste Onofré wrote: > +1 (non binding) > > I checked the LICENSE and NOTICE, a

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Piotr Findeisen
Hi, Thank you Russell for bringing up this topic and nice write-up. >From perspective of engines like Trino, equality deletes bring little value and add lot complications, so +1 from me on this. I understand they exist for a reason though. Maybe it was just a lazy choice that we should just revis

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Russell Spitzer
For users of Equality Deletes, what are the key benefits to Equality Deletes that you would like to preserve and could you please share some concrete examples of the queries you want to run (and the schemas and data sizes you would like to run them against) and the latencies that would be acceptabl

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Jason Fine
Hi, Representing Upsolver here, we also make use of Equality Deletes to deliver high frequency low latency updates to our clients at scale. We have customers using them at scale and demonstrating the need and viability. We automate the process of converting them into positional deletes (or fully a

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Jean-Baptiste Onofré
Hi Russell Thanks for the nice writeup and the proposal. I agree with your analysis, and I have the same feeling. However, I think there are more than Flink that write equality delete files. So, I agree to deprecate in V3, but maybe be more "flexible" about removal in V4 in order to give time to

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-10-31 Thread Jean-Baptiste Onofré
+1 (non binding) I checked the LICENSE and NOTICE, and they both look good to me (the same as in previous releases), so not a blocker for me. I also checked: - the planned Avro readers used in both Flink and Spark, they are actually used - Signature and hash are good - No binary file found in the

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-10-31 Thread Ajantha Bhat
Hi Russell, The LICENSE and NOTICE issue turned out to be a false positive. When I used the Archive Utility app on Mac to unzip, it automatically overwrote identical files without any prompts. However, using the command line to unzip provided a prompt for identical files. I can confirm both the fi

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Maximilian Michels
How would Flink or other engines support Upserts or Deletions without equality deletes? The only option would be to use positional deletes, but that requires to scan all data files to find the correct positions. IMHO the separation between deciding to delete based on an equality field and applying