Re: [DISCUSS] REST: OAuth2 Authentication Guide

2024-11-01 Thread Christian Thiel
Thank you for your Feedback everyone!

It would be great if we could get some more eyes from the community on the 
server-side token exchange section at the bottom of the Document. Are we aware 
of any OAuth2 secured implementations that provide tokens from the resource 
server to the client apart from tabular?

https://docs.google.com/document/d/1buW9PCNoHPeP7Br5_vZRTU-_3TExwLx6bs075gi94xc/edit?usp=sharing

@Dimitri I agree that I wasn’t quite clear enough that catalogs IdPs might 
choose to not implement all of OAuth2. As you suggested I added a hint intended 
for users at the top of the document.


Re: [DISCUSS] Change Behavior for SchemaUpdate.UnionByName

2024-11-01 Thread Rocco Varela
Thanks Russell and Fokko.

I updated my PR with the suggested updates.

Cheers,

--Rocco

On Fri, Nov 1, 2024 at 3:01 AM Fokko Driesprong  wrote:

> Hey Rocco,
>
> Thanks for raising this. I don't have any strong feelings about this, and
> I agree with Russell that it should not throw an exception.
>
> I guess there was no strong reason behind how it is today, but it's just
> because we leverage the UpdateSchema API, which raises an exception when
> doing the downcast.
>
> Also, on the Python side, it will throw an exception when you take a union
> of an int and long, but it is pretty straightforward to loosen that
> requirement: https://github.com/apache/iceberg-python/pull/1283
> 
>
> Kind regards,
> Fokko
>
> Op do 31 okt 2024 om 19:00 schreef Russell Spitzer <
> russell.spit...@gmail.com>:
>
>> I'm in favor of 1 since previously these inputs would have thrown an
>> exception that wasn't really that helpful.
>>
>> @Test
>> public void testDowncastoLongToInt() {
>>   Schema currentSchema = new Schema(required(1, "aCol", LongType.get()));
>>   Schema newSchema = new Schema(required(1, "aCol", IntegerType.get()));
>>
>>   assertThatThrownBy(() -> new SchemaUpdate(currentSchema, 
>> 1).unionByNameWith(newSchema).apply());
>> }
>>
>> I think removing states in which the API would fail is good and although
>> we didn't document it exactly this way before, I'm having a hard time
>> thinking of a case in which throwing an exception here would have been
>> preferable to a noop? Does anyone else have strong feelings around this?
>>
>>
>> On Thu, Oct 31, 2024 at 12:40 PM Rocco Varela 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Apologize if this is landing twice, my first attempt got lost somewhere
>>> in transit :)
>>>
>>> I have a PR that attempts to address
>>> https://github.com/apache/iceberg/issues/4849, basically adding logic
>>> to ignore downcasting column types when "mergeSchema" is set when an
>>> existing column type is long and the new schema has an int type for the
>>> same column.
>>>
>>> My solution involves updates to UnionByNameVisitor and this may end up
>>> changing the behavior of our public api in a way that hasn't previously
>>> been documented.
>>>
>>> Questions raised during the review is whether we should do one of the
>>> following:
>>>
>>>1. Update our docs in UpdateSchema.unionByNameWith
>>> and callout something like
>>>"We ignore differences in type if the new type is narrower than the
>>>existing type", or
>>>2. We add a new api UpdateSchema.unionByNameWith(Schema newSchema,
>>>boolean ignoreTypeNarrowing)
>>>
>>>
>>> Any feedback would be appreciated, thanks for your time.
>>>
>>> Cheers,
>>>
>>> --Rocco
>>>
>>


Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-11-01 Thread Steven Wu
+1 (binding)

Verified signature, checksum, license. Did Flink SQL local testing with the
runtime jar.

Didn't run build because Azure FileIO testing requires Docker environment.

On Fri, Nov 1, 2024 at 5:02 AM Fokko Driesprong  wrote:

> Thanks Russel for running this release!
>
> +1 (binding)
>
> Checked signatures, checksum, licenses and did some local testing.
>
> Kind regards,
> Fokko
>
>
> Op do 31 okt 2024 om 17:47 schreef Russell Spitzer <
> russell.spit...@gmail.com>:
>
>> @Manu Zhang  You are definitely right, I'll get
>> that in before I do docs release. If you don't mind I'll skip it in the
>> source though so I don't have to do another full release.
>>
>> On Thu, Oct 31, 2024 at 7:35 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> I checked the LICENSE and NOTICE, and they both look good to me (the
>>> same as in previous releases), so not a blocker for me.
>>>
>>> I also checked:
>>> - the planned Avro readers used in both Flink and Spark, they are
>>> actually used
>>> - Signature and hash are good
>>> - No binary file found in the source distribution
>>> - ASF header is present in all expected file
>>> - Build is OK
>>> - Tested using Spark SQL with JDBC Catalog and Apache Polaris without
>>> problem
>>>
>>> Thanks !
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Oct 30, 2024 at 11:06 PM Russell Spitzer
>>>  wrote:
>>> >
>>> > Hey Y'all,
>>> >
>>> > I propose that we release the following RC as the official Apache
>>> Iceberg 1.7.0 release.
>>> >
>>> > The commit ID is 91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
>>> > * This corresponds to the tag: apache-iceberg-1.7.0-rc0
>>> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.7.0-rc0
>>> > *
>>> https://github.com/apache/iceberg/tree/91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
>>> >
>>> > The release tarball, signature, and checksums are here:
>>> > *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.7.0-rc0
>>> >
>>> > You can find the KEYS file here:
>>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>> >
>>> > Convenience binary artifacts are staged on Nexus. The Maven repository
>>> URL is:
>>> > * https://repository.apache.org/content/repositories/orgapacheiceberg-
>>> /
>>> >
>>> > Please download, verify, and test.
>>> >
>>> > Please vote in the next 72 hours.
>>> >
>>> > [ ] +1 Release this as Apache Iceberg 1.7.0
>>> > [ ] +0
>>> > [ ] -1 Do not release this because...
>>> >
>>> > Only PMC members have binding votes, but other community members are
>>> encouraged to cast
>>> > non-binding votes. This vote will pass if there are 3 binding +1 votes
>>> and more binding
>>> > +1 votes than -1 votes.
>>>
>>


Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
Shani,

That is a good point. It is certainly a limitation for the Flink job to
track the inverted index internally (which is what I had in mind). It can't
be shared/synchronized with other Flink jobs or other engines writing to
the same table.

Thanks,
Steven

On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar 
wrote:

> Even if Flink can create this state, it would have to be maintained
> against the Iceberg table, we wouldn't like duplicates (keys) if other
> systems / users update the table (e.g manual insert / updates using DML).
>
> Shani.
>
> On 1 Nov 2024, at 18:32, Steven Wu  wrote:
>
> 
> > Add support for inverted indexes to reduce the cost of position lookup.
> This is fairly tricky to implement for streaming use cases without an
> external system.
>
> Anton, that is also what I was saying earlier. In Flink, the inverted
> index of (key, committed data files) can be tracked in Flink state.
>
> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi 
> wrote:
>
>> I was a bit skeptical when we were adding equality deletes, but nothing
>> beats their performance during writes. We have to find an alternative
>> before deprecating.
>>
>> We are doing a lot of work to improve streaming, like reducing the cost
>> of commits, enabling a large (potentially infinite) number of snapshots,
>> changelog reads, and so on. It is a project goal to excel in streaming.
>>
>> I was going to focus on equality deletes after completing the DV work. I
>> believe we have these options:
>>
>> - Revisit the existing design of equality deletes (e.g. add more
>> restrictions, improve compaction, offer new writers).
>> - Standardize on the view-based approach [1] to handle streaming upserts
>> and CDC use cases, potentially making this part of the spec.
>> - Add support for inverted indexes to reduce the cost of position lookup.
>> This is fairly tricky to implement for streaming use cases without an
>> external system. Our runtime filtering in Spark today is equivalent to
>> looking up positions in an inverted index represented by another Iceberg
>> table. That may still not be enough for some streaming use cases.
>>
>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/
>>
>> - Anton
>>
>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield 
>> пише:
>>
>>> I agree that equality deletes have their place in streaming.  I think
>>> the ultimate decision here is how opinionated Iceberg wants to be on its
>>> use-cases.  If it really wants to stick to its origins of "slow moving
>>> data", then removing equality deletes would be inline with this.  I think
>>> the other high level question is how much we allow for partially compatible
>>> features (the row lineage use-case feature was explicitly approved
>>> excluding equality deletes, and people seemed OK with it at the time.  If
>>> all features need to work together, then maybe we need to rethink the
>>> design here so it can be forward compatible with equality deletes).
>>>
>>> I think one issue with equality deletes as stated in the spec is that
>>> they are overly broad.  I'd be interested if people have any use cases that
>>> differ, but I think one way of narrowing (and probably a necessary building
>>> block for building something better)  the specification scope on equality
>>> deletes is to focus on upsert/Streaming deletes.  Two proposals in this
>>> regard are:
>>>
>>> 1.  Require that equality deletes can only correspond to unique
>>> identifiers for the table.
>>> 2.  Consider requiring that for equality deletes on partitioned tables,
>>> that the primary key must contain a partition column (I believe Flink at
>>> least already does this).  It is less clear to me that this would meet all
>>> existing use-cases.  But having this would allow for better incremental
>>> data-structures, which could then be partition based.
>>>
>>> Narrow scope to unique identifiers would allow for further building
>>> blocks already mentioned, like a secondary index (possible via LSM tree),
>>> that would allow for better performance overall.
>>>
>>> I generally agree with the sentiment that we shouldn't deprecate them
>>> until there is a viable replacement.  With all due respect to my employer,
>>> let's not fall into the Google trap [1] :)
>>>
>>> Cheers,
>>> Micah
>>>
>>> [1] https://goomics.net/50/
>>>
>>>
>>>
>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo 
>>> wrote:
>>>
 Hey all,

 Just to throw my 2 cents in, I agree with Steven and others that we do
 need some kind of replacement before deprecating equality deletes.
 They certainly have their problems, and do significantly increase
 complexity as they are now, but the writing of position deletes is too
 expensive for certain pipelines.

 We've been investigating using equality deletes for some of our
 workloads at Starburst, the key advantage we were hoping to leverage is
 cheap, effectively random access lookup deletes.
 Say you have a UUID column that's unique in a table and want to de

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Shani Elharrar
Even if Flink can create this state, it would have to be maintained against the Iceberg table, we wouldn't like duplicates (keys) if other systems / users update the table (e.g manual insert / updates using DML). Shani.On 1 Nov 2024, at 18:32, Steven Wu  wrote:> Add support for inverted indexes to reduce the cost of position lookup. This is fairly tricky to implement for streaming use cases without an external system.Anton, that is also what I was saying earlier. In Flink, the inverted index of (key, committed data files) can be tracked in Flink state.On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi  wrote:I was a bit skeptical when we were adding equality deletes, but nothing beats their performance during writes. We have to find an alternative before deprecating.We are doing a lot of work to improve streaming, like reducing the cost of commits, enabling a large (potentially infinite) number of snapshots, changelog reads, and so on. It is a project goal to excel in streaming.I was going to focus on equality deletes after completing the DV work. I believe we have these options:- Revisit the existing design of equality deletes (e.g. add more restrictions, improve compaction, offer new writers).- Standardize on the view-based approach [1] to handle streaming upserts and CDC use cases, potentially making this part of the spec.- Add support for inverted indexes to reduce the cost of position lookup. This is fairly tricky to implement for streaming use cases without an external system. Our runtime filtering in Spark today is equivalent to looking up positions in an inverted index represented by another Iceberg table. That may still not be enough for some streaming use cases.[1] - https://www.tabular.io/blog/hello-world-of-cdc/- Antonчт, 31 жовт. 2024 р. о 21:31 Micah Kornfield  пише:I agree that equality deletes have their place in streaming.  I think the ultimate decision here is how opinionated Iceberg wants to be on its use-cases.  If it really wants to stick to its origins of "slow moving data", then removing equality deletes would be inline with this.  I think the other high level question is how much we allow for partially compatible features (the row lineage use-case feature was explicitly approved excluding equality deletes, and people seemed OK with it at the time.  If all features need to work together, then maybe we need to rethink the design here so it can be forward compatible with equality deletes).I think one issue with equality deletes as stated in the spec is that they are overly broad.  I'd be interested if people have any use cases that differ, but I think one way of narrowing (and probably a necessary building block for building something better)  the specification scope on equality deletes is to focus on upsert/Streaming deletes.  Two proposals in this regard are:1.  Require that equality deletes can only correspond to unique identifiers for the table.2.  Consider requiring that for equality deletes on partitioned tables, that the primary key must contain a partition column (I believe Flink at least already does this).  It is less clear to me that this would meet all existing use-cases.  But having this would allow for better incremental data-structures, which could then be partition based.Narrow scope to unique identifiers would allow for further building blocks already mentioned, like a secondary index (possible via LSM tree), that would allow for better performance overall.I generally agree with the sentiment that we shouldn't deprecate them until there is a viable replacement.  With all due respect to my employer, let's not fall into the Google trap [1] :) Cheers,Micah[1] https://goomics.net/50/On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo  wrote:Hey all,Just to throw my 2 cents in, I agree with Steven and others that we do need some kind of replacement before deprecating equality deletes.They certainly have their problems, and do significantly increase complexity as they are now, but the writing of position deletes is too expensive for certain pipelines.We've been investigating using equality deletes for some of our workloads at Starburst, the key advantage we were hoping to leverage is cheap, effectively random access lookup deletes.Say you have a UUID column that's unique in a table and want to delete a row by UUID. With position deletes each delete is expensive without an index on that UUID. With equality deletes each delete is cheap and while reads/compaction is expensive but when updates are frequent and reads are sporadic that's a reasonable tradeoff.Pretty much what Jason and Steven have already said. Maybe there are some incremental improvements on equality deletes or tips from similar systems that might alleviate some of their problems?- Alex JoOn Thu, Oct 31, 2024 at 10:58 AM Steven Wu  wrote:We probably all agree with the downside of equality deletes: it postpones all the work on t

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
Fundamentally, it is very difficult to write position deletes with
concurrent writers and conflicts for batch jobs too, as the inverted index
may become invalid/stale.

The position deletes are created during the write phase. But conflicts are
only detected at the commit stage. I assume the batch job should fail in
this case.

On Fri, Nov 1, 2024 at 10:57 AM Steven Wu  wrote:

> Shani,
>
> That is a good point. It is certainly a limitation for the Flink job to
> track the inverted index internally (which is what I had in mind). It can't
> be shared/synchronized with other Flink jobs or other engines writing to
> the same table.
>
> Thanks,
> Steven
>
> On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar 
> wrote:
>
>> Even if Flink can create this state, it would have to be maintained
>> against the Iceberg table, we wouldn't like duplicates (keys) if other
>> systems / users update the table (e.g manual insert / updates using DML).
>>
>> Shani.
>>
>> On 1 Nov 2024, at 18:32, Steven Wu  wrote:
>>
>> 
>> > Add support for inverted indexes to reduce the cost of position lookup.
>> This is fairly tricky to implement for streaming use cases without an
>> external system.
>>
>> Anton, that is also what I was saying earlier. In Flink, the inverted
>> index of (key, committed data files) can be tracked in Flink state.
>>
>> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi 
>> wrote:
>>
>>> I was a bit skeptical when we were adding equality deletes, but nothing
>>> beats their performance during writes. We have to find an alternative
>>> before deprecating.
>>>
>>> We are doing a lot of work to improve streaming, like reducing the cost
>>> of commits, enabling a large (potentially infinite) number of snapshots,
>>> changelog reads, and so on. It is a project goal to excel in streaming.
>>>
>>> I was going to focus on equality deletes after completing the DV work. I
>>> believe we have these options:
>>>
>>> - Revisit the existing design of equality deletes (e.g. add more
>>> restrictions, improve compaction, offer new writers).
>>> - Standardize on the view-based approach [1] to handle streaming upserts
>>> and CDC use cases, potentially making this part of the spec.
>>> - Add support for inverted indexes to reduce the cost of position
>>> lookup. This is fairly tricky to implement for streaming use cases without
>>> an external system. Our runtime filtering in Spark today is equivalent to
>>> looking up positions in an inverted index represented by another Iceberg
>>> table. That may still not be enough for some streaming use cases.
>>>
>>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/
>>>
>>> - Anton
>>>
>>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield 
>>> пише:
>>>
 I agree that equality deletes have their place in streaming.  I think
 the ultimate decision here is how opinionated Iceberg wants to be on its
 use-cases.  If it really wants to stick to its origins of "slow moving
 data", then removing equality deletes would be inline with this.  I think
 the other high level question is how much we allow for partially compatible
 features (the row lineage use-case feature was explicitly approved
 excluding equality deletes, and people seemed OK with it at the time.  If
 all features need to work together, then maybe we need to rethink the
 design here so it can be forward compatible with equality deletes).

 I think one issue with equality deletes as stated in the spec is that
 they are overly broad.  I'd be interested if people have any use cases that
 differ, but I think one way of narrowing (and probably a necessary building
 block for building something better)  the specification scope on equality
 deletes is to focus on upsert/Streaming deletes.  Two proposals in this
 regard are:

 1.  Require that equality deletes can only correspond to unique
 identifiers for the table.
 2.  Consider requiring that for equality deletes on partitioned tables,
 that the primary key must contain a partition column (I believe Flink at
 least already does this).  It is less clear to me that this would meet all
 existing use-cases.  But having this would allow for better incremental
 data-structures, which could then be partition based.

 Narrow scope to unique identifiers would allow for further building
 blocks already mentioned, like a secondary index (possible via LSM tree),
 that would allow for better performance overall.

 I generally agree with the sentiment that we shouldn't deprecate them
 until there is a viable replacement.  With all due respect to my employer,
 let's not fall into the Google trap [1] :)

 Cheers,
 Micah

 [1] https://goomics.net/50/



 On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo <
 alex...@starburstdata.com> wrote:

> Hey all,
>
> Just to throw my 2 cents in, I agree with Steven and others that we do
> need

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Shani Elharrar
I understand how it makes sense for batch jobs, but it damages stream jobs, using equality deletes works much better for streaming (which have a strict SLA for delays), and in order to decrease the performance penalty - systems can rewrite the equality deletes to positional deletes. Shani.On 1 Nov 2024, at 20:06, Steven Wu  wrote:Fundamentally, it is very difficult to write position deletes with concurrent writers and conflicts for batch jobs too, as the inverted index may become invalid/stale. The position deletes are created during the write phase. But conflicts are only detected at the commit stage. I assume the batch job should fail in this case.On Fri, Nov 1, 2024 at 10:57 AM Steven Wu  wrote:Shani,That is a good point. It is certainly a limitation for the Flink job to track the inverted index internally (which is what I had in mind). It can't be shared/synchronized with other Flink jobs or other engines writing to the same table.Thanks,StevenOn Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar  wrote:Even if Flink can create this state, it would have to be maintained against the Iceberg table, we wouldn't like duplicates (keys) if other systems / users update the table (e.g manual insert / updates using DML). Shani.On 1 Nov 2024, at 18:32, Steven Wu  wrote:> Add support for inverted indexes to reduce the cost of position lookup. This is fairly tricky to implement for streaming use cases without an external system.Anton, that is also what I was saying earlier. In Flink, the inverted index of (key, committed data files) can be tracked in Flink state.On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi  wrote:I was a bit skeptical when we were adding equality deletes, but nothing beats their performance during writes. We have to find an alternative before deprecating.We are doing a lot of work to improve streaming, like reducing the cost of commits, enabling a large (potentially infinite) number of snapshots, changelog reads, and so on. It is a project goal to excel in streaming.I was going to focus on equality deletes after completing the DV work. I believe we have these options:- Revisit the existing design of equality deletes (e.g. add more restrictions, improve compaction, offer new writers).- Standardize on the view-based approach [1] to handle streaming upserts and CDC use cases, potentially making this part of the spec.- Add support for inverted indexes to reduce the cost of position lookup. This is fairly tricky to implement for streaming use cases without an external system. Our runtime filtering in Spark today is equivalent to looking up positions in an inverted index represented by another Iceberg table. That may still not be enough for some streaming use cases.[1] - https://www.tabular.io/blog/hello-world-of-cdc/- Antonчт, 31 жовт. 2024 р. о 21:31 Micah Kornfield  пише:I agree that equality deletes have their place in streaming.  I think the ultimate decision here is how opinionated Iceberg wants to be on its use-cases.  If it really wants to stick to its origins of "slow moving data", then removing equality deletes would be inline with this.  I think the other high level question is how much we allow for partially compatible features (the row lineage use-case feature was explicitly approved excluding equality deletes, and people seemed OK with it at the time.  If all features need to work together, then maybe we need to rethink the design here so it can be forward compatible with equality deletes).I think one issue with equality deletes as stated in the spec is that they are overly broad.  I'd be interested if people have any use cases that differ, but I think one way of narrowing (and probably a necessary building block for building something better)  the specification scope on equality deletes is to focus on upsert/Streaming deletes.  Two proposals in this regard are:1.  Require that equality deletes can only correspond to unique identifiers for the table.2.  Consider requiring that for equality deletes on partitioned tables, that the primary key must contain a partition column (I believe Flink at least already does this).  It is less clear to me that this would meet all existing use-cases.  But having this would allow for better incremental data-structures, which could then be partition based.Narrow scope to unique identifiers would allow for further building blocks already mentioned, like a secondary index (possible via LSM tree), that would allow for better performance overall.I generally agree with the sentiment that we shouldn't deprecate them until there is a viable replacement.  With all due respect to my employer, let's not fall into the Google trap [1] :) Cheers,Micah[1] https://goomics.net/50/On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo  wrote:Hey all,Just to throw my 2 cents in, I agree with Steven and others that we do need some kind of replacement before deprecating eq

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-11-01 Thread Fokko Driesprong
Thanks Russel for running this release!

+1 (binding)

Checked signatures, checksum, licenses and did some local testing.

Kind regards,
Fokko


Op do 31 okt 2024 om 17:47 schreef Russell Spitzer <
russell.spit...@gmail.com>:

> @Manu Zhang  You are definitely right, I'll get
> that in before I do docs release. If you don't mind I'll skip it in the
> source though so I don't have to do another full release.
>
> On Thu, Oct 31, 2024 at 7:35 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (non binding)
>>
>> I checked the LICENSE and NOTICE, and they both look good to me (the
>> same as in previous releases), so not a blocker for me.
>>
>> I also checked:
>> - the planned Avro readers used in both Flink and Spark, they are
>> actually used
>> - Signature and hash are good
>> - No binary file found in the source distribution
>> - ASF header is present in all expected file
>> - Build is OK
>> - Tested using Spark SQL with JDBC Catalog and Apache Polaris without
>> problem
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On Wed, Oct 30, 2024 at 11:06 PM Russell Spitzer
>>  wrote:
>> >
>> > Hey Y'all,
>> >
>> > I propose that we release the following RC as the official Apache
>> Iceberg 1.7.0 release.
>> >
>> > The commit ID is 91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
>> > * This corresponds to the tag: apache-iceberg-1.7.0-rc0
>> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.7.0-rc0
>> > *
>> https://github.com/apache/iceberg/tree/91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
>> >
>> > The release tarball, signature, and checksums are here:
>> > *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.7.0-rc0
>> >
>> > You can find the KEYS file here:
>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> >
>> > Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> > * https://repository.apache.org/content/repositories/orgapacheiceberg-
>> /
>> >
>> > Please download, verify, and test.
>> >
>> > Please vote in the next 72 hours.
>> >
>> > [ ] +1 Release this as Apache Iceberg 1.7.0
>> > [ ] +0
>> > [ ] -1 Do not release this because...
>> >
>> > Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> > non-binding votes. This vote will pass if there are 3 binding +1 votes
>> and more binding
>> > +1 votes than -1 votes.
>>
>


Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
> Add support for inverted indexes to reduce the cost of position lookup.
This is fairly tricky to implement for streaming use cases without an
external system.

Anton, that is also what I was saying earlier. In Flink, the inverted index
of (key, committed data files) can be tracked in Flink state.

On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi 
wrote:

> I was a bit skeptical when we were adding equality deletes, but nothing
> beats their performance during writes. We have to find an alternative
> before deprecating.
>
> We are doing a lot of work to improve streaming, like reducing the cost
> of commits, enabling a large (potentially infinite) number of snapshots,
> changelog reads, and so on. It is a project goal to excel in streaming.
>
> I was going to focus on equality deletes after completing the DV work. I
> believe we have these options:
>
> - Revisit the existing design of equality deletes (e.g. add more
> restrictions, improve compaction, offer new writers).
> - Standardize on the view-based approach [1] to handle streaming upserts
> and CDC use cases, potentially making this part of the spec.
> - Add support for inverted indexes to reduce the cost of position lookup.
> This is fairly tricky to implement for streaming use cases without an
> external system. Our runtime filtering in Spark today is equivalent to
> looking up positions in an inverted index represented by another Iceberg
> table. That may still not be enough for some streaming use cases.
>
> [1] - https://www.tabular.io/blog/hello-world-of-cdc/
>
> - Anton
>
> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield  пише:
>
>> I agree that equality deletes have their place in streaming.  I think the
>> ultimate decision here is how opinionated Iceberg wants to be on its
>> use-cases.  If it really wants to stick to its origins of "slow moving
>> data", then removing equality deletes would be inline with this.  I think
>> the other high level question is how much we allow for partially compatible
>> features (the row lineage use-case feature was explicitly approved
>> excluding equality deletes, and people seemed OK with it at the time.  If
>> all features need to work together, then maybe we need to rethink the
>> design here so it can be forward compatible with equality deletes).
>>
>> I think one issue with equality deletes as stated in the spec is that
>> they are overly broad.  I'd be interested if people have any use cases that
>> differ, but I think one way of narrowing (and probably a necessary building
>> block for building something better)  the specification scope on equality
>> deletes is to focus on upsert/Streaming deletes.  Two proposals in this
>> regard are:
>>
>> 1.  Require that equality deletes can only correspond to unique
>> identifiers for the table.
>> 2.  Consider requiring that for equality deletes on partitioned tables,
>> that the primary key must contain a partition column (I believe Flink at
>> least already does this).  It is less clear to me that this would meet all
>> existing use-cases.  But having this would allow for better incremental
>> data-structures, which could then be partition based.
>>
>> Narrow scope to unique identifiers would allow for further building
>> blocks already mentioned, like a secondary index (possible via LSM tree),
>> that would allow for better performance overall.
>>
>> I generally agree with the sentiment that we shouldn't deprecate them
>> until there is a viable replacement.  With all due respect to my employer,
>> let's not fall into the Google trap [1] :)
>>
>> Cheers,
>> Micah
>>
>> [1] https://goomics.net/50/
>>
>>
>>
>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo 
>> wrote:
>>
>>> Hey all,
>>>
>>> Just to throw my 2 cents in, I agree with Steven and others that we do
>>> need some kind of replacement before deprecating equality deletes.
>>> They certainly have their problems, and do significantly increase
>>> complexity as they are now, but the writing of position deletes is too
>>> expensive for certain pipelines.
>>>
>>> We've been investigating using equality deletes for some of our
>>> workloads at Starburst, the key advantage we were hoping to leverage is
>>> cheap, effectively random access lookup deletes.
>>> Say you have a UUID column that's unique in a table and want to delete a
>>> row by UUID. With position deletes each delete is expensive without an
>>> index on that UUID.
>>> With equality deletes each delete is cheap and while reads/compaction is
>>> expensive but when updates are frequent and reads are sporadic that's a
>>> reasonable tradeoff.
>>>
>>> Pretty much what Jason and Steven have already said.
>>>
>>> Maybe there are some incremental improvements on equality deletes or
>>> tips from similar systems that might alleviate some of their problems?
>>>
>>> - Alex Jo
>>>
>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu  wrote:
>>>
 We probably all agree with the downside of equality deletes: it
 postpones all the work on the read path.

 

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Anton Okolnychyi
Steven, do you have any pointers? In particular, I am curious to learn
where the state will be stored, whether it will be distributed, the lookup
cost, how to incrementally maintain that index, etc.

- Anton

пт, 1 лист. 2024 р. о 17:32 Steven Wu  пише:

> > Add support for inverted indexes to reduce the cost of position lookup.
> This is fairly tricky to implement for streaming use cases without an
> external system.
>
> Anton, that is also what I was saying earlier. In Flink, the inverted
> index of (key, committed data files) can be tracked in Flink state.
>
> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi 
> wrote:
>
>> I was a bit skeptical when we were adding equality deletes, but nothing
>> beats their performance during writes. We have to find an alternative
>> before deprecating.
>>
>> We are doing a lot of work to improve streaming, like reducing the cost
>> of commits, enabling a large (potentially infinite) number of snapshots,
>> changelog reads, and so on. It is a project goal to excel in streaming.
>>
>> I was going to focus on equality deletes after completing the DV work. I
>> believe we have these options:
>>
>> - Revisit the existing design of equality deletes (e.g. add more
>> restrictions, improve compaction, offer new writers).
>> - Standardize on the view-based approach [1] to handle streaming upserts
>> and CDC use cases, potentially making this part of the spec.
>> - Add support for inverted indexes to reduce the cost of position lookup.
>> This is fairly tricky to implement for streaming use cases without an
>> external system. Our runtime filtering in Spark today is equivalent to
>> looking up positions in an inverted index represented by another Iceberg
>> table. That may still not be enough for some streaming use cases.
>>
>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/
>>
>> - Anton
>>
>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield 
>> пише:
>>
>>> I agree that equality deletes have their place in streaming.  I think
>>> the ultimate decision here is how opinionated Iceberg wants to be on its
>>> use-cases.  If it really wants to stick to its origins of "slow moving
>>> data", then removing equality deletes would be inline with this.  I think
>>> the other high level question is how much we allow for partially compatible
>>> features (the row lineage use-case feature was explicitly approved
>>> excluding equality deletes, and people seemed OK with it at the time.  If
>>> all features need to work together, then maybe we need to rethink the
>>> design here so it can be forward compatible with equality deletes).
>>>
>>> I think one issue with equality deletes as stated in the spec is that
>>> they are overly broad.  I'd be interested if people have any use cases that
>>> differ, but I think one way of narrowing (and probably a necessary building
>>> block for building something better)  the specification scope on equality
>>> deletes is to focus on upsert/Streaming deletes.  Two proposals in this
>>> regard are:
>>>
>>> 1.  Require that equality deletes can only correspond to unique
>>> identifiers for the table.
>>> 2.  Consider requiring that for equality deletes on partitioned tables,
>>> that the primary key must contain a partition column (I believe Flink at
>>> least already does this).  It is less clear to me that this would meet all
>>> existing use-cases.  But having this would allow for better incremental
>>> data-structures, which could then be partition based.
>>>
>>> Narrow scope to unique identifiers would allow for further building
>>> blocks already mentioned, like a secondary index (possible via LSM tree),
>>> that would allow for better performance overall.
>>>
>>> I generally agree with the sentiment that we shouldn't deprecate them
>>> until there is a viable replacement.  With all due respect to my employer,
>>> let's not fall into the Google trap [1] :)
>>>
>>> Cheers,
>>> Micah
>>>
>>> [1] https://goomics.net/50/
>>>
>>>
>>>
>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo 
>>> wrote:
>>>
 Hey all,

 Just to throw my 2 cents in, I agree with Steven and others that we do
 need some kind of replacement before deprecating equality deletes.
 They certainly have their problems, and do significantly increase
 complexity as they are now, but the writing of position deletes is too
 expensive for certain pipelines.

 We've been investigating using equality deletes for some of our
 workloads at Starburst, the key advantage we were hoping to leverage is
 cheap, effectively random access lookup deletes.
 Say you have a UUID column that's unique in a table and want to delete
 a row by UUID. With position deletes each delete is expensive without an
 index on that UUID.
 With equality deletes each delete is cheap and while reads/compaction
 is expensive but when updates are frequent and reads are sporadic that's a
 reasonable tradeoff.

 Pretty much what Jason and Steven have already 

Re: [DISCUSS] Partial Metadata Loading

2024-11-01 Thread Dmitri Bourlatchkov
Hello All,

This is an interesting discussion and I'd like to offer my perspective.

When a REST Catalog is involved, the metadata is loaded and modified via
the catalog API. So control over the metadata is delegated to the catalog.

I'd argue that in this situation, catalogs should have the flexibility to
optimize metadata operations internally. In other words, if a particular
use case does not require access to some pieces of metadata, the catalog
should have to provide them. For example, querying a particular snapshot
does not require knowledge of other snapshots.

I understand that the current metadata representation evolved to support
certain use cases. Still, as far as API v2 is concerned, would it have to
match what was happening in API v1? I think this is an opportunity to
design API v2 in a more flexible and extensible manner.

On the point of complexity (and I think adoption concerns are but a
consequence of complexity). I believe if the API is modelled to supply
information required for particular use cases as opposed to representing a
particular state of the table as a whole, the complexity can be reduced.

In other words, I propose to make API v2 such that it focuses on what
clients (engines) require for operation as opposed to what the table
metadata has in its totality at any moment in time. In a way, API v2
outputs do not have to be exact chunks of metadata carved out of physical
files, but may be defined differently, linking to server-side metadata only
conceptually.

More specifically, if the client queries a table, it declares this intent
in API and receives the information required for the query. The client
should be prepared to receive more information than it needs (in case the
server does not support metadata slicing), but that should not add
complexity as discarding unused data should not be hard if the data
structures allow for slicing. In effect, actual runtime efficiencies will
be defined by the combined efforts of the client (engine) and catalog. At
the same time neither the client, nor the catalog is required to implement
advanced use cases.

Similarly, if the client is only interested to know whether a table changed
since point X (time or snapshot), that is also expressed in the API
request. It may be a separate endpoint, or it may be possible to implement
it as, for example, returning the latest snapshot ID.

I understand, there are use cases where engines want to operate directly on
metadata files in storage. That is fine too, IMO, I am not proposing to
change the Iceberg file format spec. At the same time catalogs do not have
to be limited to fetching data for the REST API from those files. Catalogs
may choose to have additional storage partitioned and indexed differently
than plain files.

This is all very high level, of course, and it requires a lot of additional
thinking about how to design API v2,  but I believe we could achieve a more
supportable and adoptable API v2 this way.

Cheers,
Dmitri.

On Thu, Oct 31, 2024 at 2:41 PM Daniel Weeks  wrote:

> Eric,
>
> With respect to the credential endpoint, I believe there is important
> context missing that probably should have been captured in the doc.  The
> credential endpoint is unlike other use cases because the fundamental issue
> is that refresh is an operation that happens across distributed workers.
> Workers in spark/flink/trino/etc. all need to refresh credentials for long
> running operations and results in orders of magnitude higher request rates
> than a table load.  We originally expected to use the table load even for
> this, but the concern was it would effectively DDOS the catalog.
>
> If there are specific cases that have solid justification like the above,
> I think we should add specific endpoints, but those should be used
> sparingly.
>
> > In other words -- if it's true that "partial metadata doesn't align with
> primary use cases", it seems true that "full metadata doesn't align with 
> *almost
> all* use cases".
>
> I don't find this argument compelling.  Are you saying that any case where
> everything from a response isn't fully used, you should optimize that
> request so that a client can only request the specific information it will
> use?  Generally, we want a surface area that can address most use cases and
> as a consequence, not every request is going to perfectly match the
> specific needs of the client.
>
>  -Dan
>
>
> On Thu, Oct 31, 2024 at 11:03 AM Eric Maynard 
> wrote:
>
>> Thanks for this breakdown Dan.
>>
>> I share your concerns about the complexity this might impose on the
>> client. On some of your other notes, I have some thoughts below:
>>
>>
>> Several Apache Polaris (Incubating) committers were in the recent sync on
>> this proposal, so I want to share one perspective related to the last point
>> re: *Partial metadata impedes adoption*.
>>
>> Personally, I feel better about the prospect of Polaris supporting a
>> flexible loadTableV2-type API as opposed to having to keep adding 

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-11-01 Thread Daniel Weeks
+1 (binding)

Verified sigs/sums/license/build/test

Also did some manual verification using spark and everything checks out.

-Dan

On Fri, Nov 1, 2024 at 10:52 AM Steven Wu  wrote:

> +1 (binding)
>
> Verified signature, checksum, license. Did Flink SQL local testing with
> the runtime jar.
>
> Didn't run build because Azure FileIO testing requires Docker environment.
>
> On Fri, Nov 1, 2024 at 5:02 AM Fokko Driesprong  wrote:
>
>> Thanks Russel for running this release!
>>
>> +1 (binding)
>>
>> Checked signatures, checksum, licenses and did some local testing.
>>
>> Kind regards,
>> Fokko
>>
>>
>> Op do 31 okt 2024 om 17:47 schreef Russell Spitzer <
>> russell.spit...@gmail.com>:
>>
>>> @Manu Zhang  You are definitely right, I'll
>>> get that in before I do docs release. If you don't mind I'll skip it in the
>>> source though so I don't have to do another full release.
>>>
>>> On Thu, Oct 31, 2024 at 7:35 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 (non binding)

 I checked the LICENSE and NOTICE, and they both look good to me (the
 same as in previous releases), so not a blocker for me.

 I also checked:
 - the planned Avro readers used in both Flink and Spark, they are
 actually used
 - Signature and hash are good
 - No binary file found in the source distribution
 - ASF header is present in all expected file
 - Build is OK
 - Tested using Spark SQL with JDBC Catalog and Apache Polaris without
 problem

 Thanks !

 Regards
 JB

 On Wed, Oct 30, 2024 at 11:06 PM Russell Spitzer
  wrote:
 >
 > Hey Y'all,
 >
 > I propose that we release the following RC as the official Apache
 Iceberg 1.7.0 release.
 >
 > The commit ID is 91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
 > * This corresponds to the tag: apache-iceberg-1.7.0-rc0
 > * https://github.com/apache/iceberg/commits/apache-iceberg-1.7.0-rc0
 > *
 https://github.com/apache/iceberg/tree/91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
 >
 > The release tarball, signature, and checksums are here:
 > *
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.7.0-rc0
 >
 > You can find the KEYS file here:
 > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
 >
 > Convenience binary artifacts are staged on Nexus. The Maven
 repository URL is:
 > *
 https://repository.apache.org/content/repositories/orgapacheiceberg-
 /
 >
 > Please download, verify, and test.
 >
 > Please vote in the next 72 hours.
 >
 > [ ] +1 Release this as Apache Iceberg 1.7.0
 > [ ] +0
 > [ ] -1 Do not release this because...
 >
 > Only PMC members have binding votes, but other community members are
 encouraged to cast
 > non-binding votes. This vote will pass if there are 3 binding +1
 votes and more binding
 > +1 votes than -1 votes.

>>>


Re: [DISCUSS] REST: OAuth2 Authentication Guide

2024-11-01 Thread Dmitri Bourlatchkov
Hi Christian,

Thanks for pushing this initiative forward. I think it is quite useful.

I added some rather minor comments to the doc.

One bigger aspect of this, I guess, is that the doc currently talks about
what clients should do. This is important, of course. However, if a client
is able to obtain an access token via OAuth2 flows, it does not
automatically mean that any catalog implementation will be able to accept
such a token. What do you think about adding a section to clarify that
servers are free to support integrations with IdPs of their choosing but it
is not guaranteed and that users should check the documentation of the
catalog implementation about what exactly is supported with respect to
OAuth2.

WDYT?

Thanks,
Dmitri.

On Fri, Oct 25, 2024 at 12:25 PM Christian Thiel
 wrote:

> Thanks everyone for your Feedback in the Catalog Sync and afterwards!
>
> I tried to address most of the Feedback and updated the Document.
>
>
>
>- The updated Document can be found here [1]:
>
>
> https://docs.google.com/document/d/1buW9PCNoHPeP7Br5_vZRTU-_3TExwLx6bs075gi94xc/edit?usp=sharing
>- It is linked now to an improvement (#11286 [2]) which also contains
>much of the Motivation which does not need to be part of the docs itself.
>
>
>
> I would ask all interested parties to leave comments in the google doc. If
> further clarification is needed, we can also discuss it again in the
> catalog sync.
>
>
>
> [1]
> https://docs.google.com/document/d/1buW9PCNoHPeP7Br5_vZRTU-_3TExwLx6bs075gi94xc/edit?usp=sharing
>
> [2] https://github.com/apache/iceberg/issues/11286
>
>
>
>
>
>
>
> *From: *Yufei Gu 
> *Date: *Saturday, 12. October 2024 at 12:30
> *To: *dev@iceberg.apache.org 
> *Subject: *Re: [DISCUSS] REST: OAuth2 Authentication Guide
>
> Thanks Christian. Nice write-up! Authentication is essential to a
> production env. It's great to document it well given a lot of people don't
> necessarily have enough OAthen2 knowledge. Looking forward to the doc PRs
> and other client side changes.
>
>
> Yufei
>
>
>
>
>
> On Wed, Sep 18, 2024 at 8:31 AM Dmitri Bourlatchkov
>  wrote:
>
> Hi Christian,
>
>
>
> Very nice proposal. Thanks for putting it together! I added some comments
> to the doc.
>
>
>
> I think it is related to PR #10753 [4], which proposes some foundational
> refactoring to the java REST client to enable further enhancements in
> OAuth2 flows.
>
>
>
> Cheers,
>
> Dmitri.
>
>
>
> [4] https://github.com/apache/iceberg/pull/10753
>
>
>
> On Wed, Sep 18, 2024 at 4:12 AM Christian Thiel
>  wrote:
>
> Dear everyone,
>
>
> the Iceberg REST specification allows for different ways of
> Authentication, OAuth2 is one of them. Until recently the OAuth2 /token
> endpoint was part of the REST-spec together with datatypes required for the
> client-credential flow. Both have since been removed from the spec for
> security reasons [2].
>
> Probably because it was a part of the spec before, clients today typically
> only implement the client-credential flow. This stays behind OAuth2
> capabilities and is unsuitable for human users. Common IdPs do not
> implement the client-credential flow for principals of human users for good
> reasons.
>
>
>
> To mitigate this problem, we propose an extension of the Iceberg
> documentation in 3 steps. This proposal is neither an extension of the
> Iceberg REST Catalog specification nor OAuth2 itself. The Iceberg REST
> specification already specifies OAuth2 Authentication [3], which includes
> all the flows mentioned in the document of this proposal [1].
>
>
>
> My proposal to go forward is as follows:
>
>1. Use this proposals Google Doc for alignment in the community: [1]
>
>
> https://docs.google.com/document/d/1A6bJfSzkTzDWUIegdckSsoaeFxZl1Qn5htI1jzyBQss/edit?usp=sharing
>Discuss in a catalog sync in 1-2 weeks.
>2. Condense consensus found in Google Doc to .md and add it to docs
>3. Implement additional flows in the iceberg-(java, python, rust ..)
>packages.
>For Java there is already a PR that goes in this direction which could
>use some more attention: https://github.com/apache/iceberg/pull/10753
>For other languages I am not aware of any initiatives.
>4. Encourage clients to allow configuration of new flows for users
>
> Any feedback welcome!
>
> Thanks
> - Christian
>
> [1]:
> https://docs.google.com/document/d/1A6bJfSzkTzDWUIegdckSsoaeFxZl1Qn5htI1jzyBQss/edit?usp=sharing
> [2]:
> https://docs.google.com/document/d/1Xi5MRk8WdBWFC3N_eSmVcrLhk3yu5nJ9x_wC0ec6kVQ
>
> [3]:
> https://github.com/apache/iceberg/blob/ed73ec43dd25c9023069ea1d3381a6d9229be53a/open-api/rest-catalog-open-api.yaml#L61
>
>
>
>


Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-11-01 Thread Amogh Jahagirdar
+1 (binding)

Verified signature/checksum/license/build/test with JDK17

Thanks for driving the release Russell!

Amogh Jahagirdar

On Fri, Nov 1, 2024 at 12:36 PM Daniel Weeks  wrote:

> +1 (binding)
>
> Verified sigs/sums/license/build/test
>
> Also did some manual verification using spark and everything checks out.
>
> -Dan
>
> On Fri, Nov 1, 2024 at 10:52 AM Steven Wu  wrote:
>
>> +1 (binding)
>>
>> Verified signature, checksum, license. Did Flink SQL local testing with
>> the runtime jar.
>>
>> Didn't run build because Azure FileIO testing requires Docker environment.
>>
>> On Fri, Nov 1, 2024 at 5:02 AM Fokko Driesprong  wrote:
>>
>>> Thanks Russel for running this release!
>>>
>>> +1 (binding)
>>>
>>> Checked signatures, checksum, licenses and did some local testing.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>>
>>> Op do 31 okt 2024 om 17:47 schreef Russell Spitzer <
>>> russell.spit...@gmail.com>:
>>>
 @Manu Zhang  You are definitely right, I'll
 get that in before I do docs release. If you don't mind I'll skip it in the
 source though so I don't have to do another full release.

 On Thu, Oct 31, 2024 at 7:35 AM Jean-Baptiste Onofré 
 wrote:

> +1 (non binding)
>
> I checked the LICENSE and NOTICE, and they both look good to me (the
> same as in previous releases), so not a blocker for me.
>
> I also checked:
> - the planned Avro readers used in both Flink and Spark, they are
> actually used
> - Signature and hash are good
> - No binary file found in the source distribution
> - ASF header is present in all expected file
> - Build is OK
> - Tested using Spark SQL with JDBC Catalog and Apache Polaris without
> problem
>
> Thanks !
>
> Regards
> JB
>
> On Wed, Oct 30, 2024 at 11:06 PM Russell Spitzer
>  wrote:
> >
> > Hey Y'all,
> >
> > I propose that we release the following RC as the official Apache
> Iceberg 1.7.0 release.
> >
> > The commit ID is 91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
> > * This corresponds to the tag: apache-iceberg-1.7.0-rc0
> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.7.0-rc0
> > *
> https://github.com/apache/iceberg/tree/91e04c9c88b63dc01d6c8e69dfdc8cd27ee811cc
> >
> > The release tarball, signature, and checksums are here:
> > *
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.7.0-rc0
> >
> > You can find the KEYS file here:
> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> >
> > Convenience binary artifacts are staged on Nexus. The Maven
> repository URL is:
> > *
> https://repository.apache.org/content/repositories/orgapacheiceberg-
> /
> >
> > Please download, verify, and test.
> >
> > Please vote in the next 72 hours.
> >
> > [ ] +1 Release this as Apache Iceberg 1.7.0
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Only PMC members have binding votes, but other community members are
> encouraged to cast
> > non-binding votes. This vote will pass if there are 3 binding +1
> votes and more binding
> > +1 votes than -1 votes.
>



Re: [DISCUSS] Change Behavior for SchemaUpdate.UnionByName

2024-11-01 Thread Fokko Driesprong
Hey Rocco,

Thanks for raising this. I don't have any strong feelings about this, and I
agree with Russell that it should not throw an exception.

I guess there was no strong reason behind how it is today, but it's just
because we leverage the UpdateSchema API, which raises an exception when
doing the downcast.

Also, on the Python side, it will throw an exception when you take a union
of an int and long, but it is pretty straightforward to loosen that
requirement: https://github.com/apache/iceberg-python/pull/1283


Kind regards,
Fokko

Op do 31 okt 2024 om 19:00 schreef Russell Spitzer <
russell.spit...@gmail.com>:

> I'm in favor of 1 since previously these inputs would have thrown an
> exception that wasn't really that helpful.
>
> @Test
> public void testDowncastoLongToInt() {
>   Schema currentSchema = new Schema(required(1, "aCol", LongType.get()));
>   Schema newSchema = new Schema(required(1, "aCol", IntegerType.get()));
>
>   assertThatThrownBy(() -> new SchemaUpdate(currentSchema, 
> 1).unionByNameWith(newSchema).apply());
> }
>
> I think removing states in which the API would fail is good and although
> we didn't document it exactly this way before, I'm having a hard time
> thinking of a case in which throwing an exception here would have been
> preferable to a noop? Does anyone else have strong feelings around this?
>
>
> On Thu, Oct 31, 2024 at 12:40 PM Rocco Varela 
> wrote:
>
>> Hi everyone,
>>
>> Apologize if this is landing twice, my first attempt got lost somewhere
>> in transit :)
>>
>> I have a PR that attempts to address
>> https://github.com/apache/iceberg/issues/4849, basically adding logic to
>> ignore downcasting column types when "mergeSchema" is set when an existing
>> column type is long and the new schema has an int type for the same column.
>>
>> My solution involves updates to UnionByNameVisitor and this may end up
>> changing the behavior of our public api in a way that hasn't previously
>> been documented.
>>
>> Questions raised during the review is whether we should do one of the
>> following:
>>
>>1. Update our docs in UpdateSchema.unionByNameWith
>> and callout something like "We
>>ignore differences in type if the new type is narrower than the existing
>>type", or
>>2. We add a new api UpdateSchema.unionByNameWith(Schema newSchema,
>>boolean ignoreTypeNarrowing)
>>
>>
>> Any feedback would be appreciated, thanks for your time.
>>
>> Cheers,
>>
>> --Rocco
>>
>


Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Anton Okolnychyi
I was a bit skeptical when we were adding equality deletes, but nothing
beats their performance during writes. We have to find an alternative
before deprecating.

We are doing a lot of work to improve streaming, like reducing the cost
of commits, enabling a large (potentially infinite) number of snapshots,
changelog reads, and so on. It is a project goal to excel in streaming.

I was going to focus on equality deletes after completing the DV work. I
believe we have these options:

- Revisit the existing design of equality deletes (e.g. add more
restrictions, improve compaction, offer new writers).
- Standardize on the view-based approach [1] to handle streaming upserts
and CDC use cases, potentially making this part of the spec.
- Add support for inverted indexes to reduce the cost of position lookup.
This is fairly tricky to implement for streaming use cases without an
external system. Our runtime filtering in Spark today is equivalent to
looking up positions in an inverted index represented by another Iceberg
table. That may still not be enough for some streaming use cases.

[1] - https://www.tabular.io/blog/hello-world-of-cdc/

- Anton

чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield  пише:

> I agree that equality deletes have their place in streaming.  I think the
> ultimate decision here is how opinionated Iceberg wants to be on its
> use-cases.  If it really wants to stick to its origins of "slow moving
> data", then removing equality deletes would be inline with this.  I think
> the other high level question is how much we allow for partially compatible
> features (the row lineage use-case feature was explicitly approved
> excluding equality deletes, and people seemed OK with it at the time.  If
> all features need to work together, then maybe we need to rethink the
> design here so it can be forward compatible with equality deletes).
>
> I think one issue with equality deletes as stated in the spec is that they
> are overly broad.  I'd be interested if people have any use cases that
> differ, but I think one way of narrowing (and probably a necessary building
> block for building something better)  the specification scope on equality
> deletes is to focus on upsert/Streaming deletes.  Two proposals in this
> regard are:
>
> 1.  Require that equality deletes can only correspond to unique
> identifiers for the table.
> 2.  Consider requiring that for equality deletes on partitioned tables,
> that the primary key must contain a partition column (I believe Flink at
> least already does this).  It is less clear to me that this would meet all
> existing use-cases.  But having this would allow for better incremental
> data-structures, which could then be partition based.
>
> Narrow scope to unique identifiers would allow for further building blocks
> already mentioned, like a secondary index (possible via LSM tree), that
> would allow for better performance overall.
>
> I generally agree with the sentiment that we shouldn't deprecate them
> until there is a viable replacement.  With all due respect to my employer,
> let's not fall into the Google trap [1] :)
>
> Cheers,
> Micah
>
> [1] https://goomics.net/50/
>
>
>
> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo 
> wrote:
>
>> Hey all,
>>
>> Just to throw my 2 cents in, I agree with Steven and others that we do
>> need some kind of replacement before deprecating equality deletes.
>> They certainly have their problems, and do significantly increase
>> complexity as they are now, but the writing of position deletes is too
>> expensive for certain pipelines.
>>
>> We've been investigating using equality deletes for some of our workloads
>> at Starburst, the key advantage we were hoping to leverage is cheap,
>> effectively random access lookup deletes.
>> Say you have a UUID column that's unique in a table and want to delete a
>> row by UUID. With position deletes each delete is expensive without an
>> index on that UUID.
>> With equality deletes each delete is cheap and while reads/compaction is
>> expensive but when updates are frequent and reads are sporadic that's a
>> reasonable tradeoff.
>>
>> Pretty much what Jason and Steven have already said.
>>
>> Maybe there are some incremental improvements on equality deletes or tips
>> from similar systems that might alleviate some of their problems?
>>
>> - Alex Jo
>>
>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu  wrote:
>>
>>> We probably all agree with the downside of equality deletes: it
>>> postpones all the work on the read path.
>>>
>>> In theory, we can implement position deletes only in the Flink streaming
>>> writer. It would require the tracking of last committed data files per key,
>>> which can be stored in Flink state (checkpointed). This is obviously quite
>>> expensive/challenging, but possible.
>>>
>>> I like to echo one benefit of equality deletes that Russel called out in
>>> the original email. Equality deletes would never have conflicts. that is
>>> important for streaming writers (Flink, Kafk