Re: [VOTE] Release Apache Iceberg 1.2.0 RC1

2023-03-15 Thread Szehon Ho
Hi,

One note, on this release, I ran some simple spark-SQL using a local Spark,
like  "insert into table select 1".  I find any of these operation  now
spawns 200 executors and takes awhile to finish.

|== Physical Plan ==\nAppendData
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$4700/0x000801b1b040@2934b897,
IcebergWrite(table=iceberg.szho.test, format=PARQUET)\n+- AdaptiveSparkPlan
isFinalPlan=false\n   +- Exchange hashpartitioning(a#413, 200),
REPARTITION_BY_NUM, [id=#363]\n  +- Project [1 AS id#412, b AS a#413]\n
+- Scan OneRowRelation[]\n\n|

I think its expected, due to the distribution mode default change, which
penalizes smaller jobs.  I think it'd be nice to have some doc
guidances for more pleasant user experience for new users?  Maybe a note in
the getting-started guide on how to reduce number of executors/ or turn off
the distribution mode.

That being said, I'm +1 (non-binding), aside from that.

   - Verified signature
   - Verified checkstum
   - Rat check license
   - Ran build and test (some aws test failed to create embedded jetty
   server because of keystore, probably local environment error)
   - Ran simple operations on Spark


Thanks
Szehon

On Wed, Mar 15, 2023 at 8:54 AM Eduard Tudenhoefner 
wrote:

> +1 (non-binding)
>
>- validated checksum and signature
>- checked license docs & ran RAT checks
>- ran build and tests with JDK11
>- integrated into Trino 
> / Presto  and our
>internal platform
>- ran a few manual steps in Spark 3.3
>
>
> Just FYI that the release notes will usually be available once voting on
> the RC passed and artifacts are publicly available.
>
> Thanks
> Eduard
>
> On Tue, Mar 14, 2023 at 5:19 AM Jack Ye  wrote:
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official Apache Iceberg
>> 1.2.0 release.
>>
>> The commit ID is e340ad5be04e902398c576f431810c3dfa4fe717
>> * This corresponds to the tag: apache-iceberg-1.2.0-rc1
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.2.0-rc1
>> *
>> https://github.com/apache/iceberg/tree/e340ad5be04e902398c576f431810c3dfa4fe717
>>
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.2.0-rc1
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1121/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 1.2.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>


Re: [Discuss] Allow all users who have Committed to the project to run CI without Approval

2023-03-29 Thread Szehon Ho
+1

Thanks
Szehon

On Wed, Mar 29, 2023 at 10:27 AM Eduard Tudenhoefner 
wrote:

> +1 for "Only requires approval first time"
>
> On Wed, Mar 29, 2023 at 6:32 PM John Zhuge  wrote:
>
>> +1 for "Only requires approval first time"
>>
>> On Wed, Mar 29, 2023 at 9:03 AM Ajantha Bhat 
>> wrote:
>>
>>>
>>> Thanks Russell for creating the ticket.
>>>
>>> +1 for going back to "Only requires approval first time"
>>>
>>> I think in the ticket we also have to clearly say that we take the
>>> responsibility of actively monitoring the workflows for abuse
>>> as mentioned in https://infra.apache.org/github-actions-policy.html
>>>
>>> Thanks,
>>>
>>> Ajantha
>>>
>>> On Wed, Mar 29, 2023 at 9:24 PM Steven Wu  wrote:
>>>
 +1 for "only requires approval for the first time". Thanks, Russel!

 On Wed, Mar 29, 2023 at 8:35 AM Ryan Blue  wrote:

> +1 Thanks for doing this, Russell!
>
> On Wed, Mar 29, 2023 at 8:31 AM Jack Ye  wrote:
>
>> +1 for "Only requires approval first time", thank you for submitting
>> the ticket Russell!
>>
>> Best,
>> Jack Ye
>>
>> On Wed, Mar 29, 2023 at 8:13 AM YoungXinLer-邮箱
>> <524022...@qq.com.invalid> wrote:
>>
>>> +1 for "Only requires approval first time"
>>>
>>>
>>> -- Original --
>>> *From:* "Russell Spitzer";
>>> *Date:* 2023年3月29日(星期三) 晚上10:54
>>> *To:* "dev";
>>> *Subject:* [Discuss] Allow all users who have Committed to the
>>> project to run CI without Approval
>>>
>>> Recent moves by Apache Infra have changed the policy on github
>>> actions from "Only requires approval first time" to "Requires approval
>>> every time".
>>> I think this is a big step backwards in terms of getting folks
>>> involved in the project and in terms of the amount of committer busy 
>>> work
>>> required
>>> to validate new pull requests.
>>>
>>> I've created a new Infra ticket
>>> https://issues.apache.org/jira/browse/INFRA-24400 to change our
>>> behavior back to the old standard.
>>>
>>> I'd like to make sure folks are generally in favor of changing the
>>> default back, please respond to this thread if you are in support of
>>> going back to "Only requires approval first time" or if you don't
>>> believe this is a good idea please respond as well.
>>>
>>>
>>> Thanks for your time,
>>> Russ
>>
>>
>
> --
> Ryan Blue
> Tabular
>
 --
>> John Zhuge
>>
>


Re: [VOTE] Release Apache Iceberg 1.2.1 RC2

2023-04-06 Thread Szehon Ho
+1 (non-binding)

Verified signature
Verified checksum
Verified License
Built and ran tests
Ran simple queries on spark 3.3.

Thanks Dan for the release,
Szehon

On Thu, Apr 6, 2023 at 12:04 PM Daniel Weeks  wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 1.2.1 release.
>
> The commit ID is 4e2cdccd7453603af42a090fc5530f2bd20cf1be
> * This corresponds to the tag: apache-iceberg-1.2.1-rc2
> * https://github.com/apache/iceberg/commits/apache-iceberg-1.2.1-rc2
> *
> https://github.com/apache/iceberg/tree/4e2cdccd7453603af42a090fc5530f2bd20cf1be
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.2.1-rc2
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1131/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 1.2.1
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: Welcome new PMC members!

2023-04-12 Thread Szehon Ho
Nice, congratulations guys!
Szehon

On Wed, Apr 12, 2023 at 12:35 AM Gidon Gershinsky  wrote:

> Congrats Fokko, Steven, Yufei!
>
> Cheers, Gidon
>
>
> On Wed, Apr 12, 2023 at 7:14 AM Ajantha Bhat 
> wrote:
>
>> Congratulations to all.
>>
>> On Wed, Apr 12, 2023 at 8:51 AM OpenInx  wrote:
>>
>>> Congrats !
>>>
>>> On Wed, Apr 12, 2023 at 10:25 AM Junjie Chen 
>>> wrote:
>>>
 Congratulations to all of you!

 On Wed, Apr 12, 2023 at 10:07 AM Reo Lei  wrote:

> Congratulations!!!
>
> yuxia  于2023年4月12日周三 09:19写道:
>
>> Congratulations to all!
>>
>> Best regards,
>> Yuxia
>>
>> --
>> *发件人: *"Russell Spitzer" 
>> *收件人: *"dev" 
>> *发送时间: *星期三, 2023年 4 月 12日 上午 6:13:01
>> *主题: *Re: Welcome new PMC members!
>>
>> Great news, Congratulations to all!
>>
>> On Apr 11, 2023, at 5:11 PM, Dmitri Bourlatchkov <
>> dmitri.bourlatch...@dremio.com.INVALID> wrote:
>>
>> Congratulations Fokko, Steven, and Yufei!
>>
>> On Tue, Apr 11, 2023 at 5:22 PM Ryan Blue  wrote:
>>
>>> Hi everyone!
>>>
>>> I want to congratulate 3 new PMC members, Fokko Driesprong, Steven
>>> Wu, and Yufei Gu. Thanks for all your contributions!
>>>
>>> I was going to wait a little longer to announce, but since they're
>>> in our board report it's already out.
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>>

 --
 Best Regards

>>>


Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
Yea I agree, I had a handy query for the last update time of partition.

SELECT

e.data_file.partition,

MAX(s.committed_at) AS last_modified_time

FROM db.table.snapshots s

JOIN db.table.entries e

WHERE s.snapshot_id = e.snapshot_id

GROUP BY by e.data_file.partition

It's a bit lengthy currently.

I have been indeed thinking to look at adding these fields to the
Partitions table directly, after Ajantha's pending changes to add delete
files to this table.

Thanks
Szehon

On Tue, May 2, 2023 at 4:08 PM Ryan Blue  wrote:

> Pucheng,
>
> Rather than using the changelog, I'd just look at the metadata tables. You
> should be able to query the all_entries metadata table to see file
> additions or deletions for a given snapshot. Then from there you can join
> to the snapshots table for timestamps and aggregate to the partition level.
>
> Ryan
>
> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang 
> wrote:
>
>> Hi Ajantha and the community,
>>
>> I am interested and I am wondering where we can see the latest progress
>> of this feature?
>>
>> Regarding the partition stats in Iceberg, I am specifically curious if we
>> can consider a new field called "last modified time" to be included for the
>> partitions stats (or have a plugable way to allow users to
>> configure partition stats they need). My use case is to find out if a
>> partition is changed or not given two snapshots (old and new) with a
>> quick and light way process. I previously was suggested by the community to
>> use the change log (CDC) but I think that is too heavy (I guess, since it
>> requires to run SparkSQL procedure) and it is over do the work (I don't
>> need what rows are changed, I just need true or false for whether a
>> partition is changed).
>>
>> Thanks
>>
>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>>> Thanks Ajantha.
>>>
>>>
>>>
>>> > It should be very easy to add a few more fields to it like the latest
>>> sequence number or last modified time per partition.
>>>
>>>
>>>
>>> Among sequence number and modified time, which one do you think is more
>>> likely to be available in Iceberg partition stats? Note that we would like
>>> to avoid compaction change the sequence number or modified time stats.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Ajantha Bhat 
>>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>
>>>
>>>
>>> Hi Hrishi and Mayur, thanks for the inputs.
>>>
>>> To get things moving I have frozen the scope of phase 1 implementation.
>>> (Recently added the delete file stats to phase 1 too). You can find the
>>> scope in the "Design for approval" section of the design doc.
>>>
>>> That said, once we have phase 1 implemented, It should be very easy to
>>> add a few more fields to it like the latest sequence number or last
>>> modified time per partition.
>>> I will be opening up the discussion about phase 2 schema again once
>>> phase 1 implementation is done.
>>>
>>> Thanks,
>>> Ajantha
>>>
>>>
>>>
>>> On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> +1 for the initiative.
>>>
>>>
>>>
>>> We’ve been exploring options for storing last-modified-time per
>>> partition. It an important building block for data pipelines – especially
>>> if there is a dependency between jobs with strong consistency requirements.
>>>
>>>
>>>
>>> Is partition stats a good place for storing last-modified-time per
>>> partition?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Ajantha Bhat 
>>> *Sent:* Monday, January 23, 2023 11:56 AM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>
>>>
>>>
>>> Hi All,
>>>
>>> In the same design document (
>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>> ),
>>> I have added a section called
>>> *"Design for approval".  *It also contains a potential PR breakdown for
>>> the phase 1 implementation and future development scope.
>>> Please take a look and please vote if you think the design is ok.
>>>
>>> Thanks,
>>> Ajantha
>>>
>>>
>>>
>>> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat 
>>> wrote:
>>>
>>> A big thanks to everyone who was involved in the review and the
>>> discussions so far.
>>>
>>> Please find the meeting minutes from the last iceberg sync about the
>>> partition stats.
>>> a. Writers should not write the partition stats or any stats as of
>>> now.
>>> Because it requires bumping the spec to V3. (We can have it as
>>> part of the v3 spec later on. But not anytime soon).
>>> b. So, there can be an async way of generating the stats like
>>> ANALYZE table or call procedure.
>>> Which will compute the stats till the current snapshot and store
>>> it as a partition stats file.
>>> c. In phase 1, partition stats will just store the row_count and
>>> file_count per partiti

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
> Does snapshot expiration needs to be disabled for this to work? Thanks,
> Mayur
>

Yes, the snapshot that last updated the partition needs to be around for
this to work.

 Szehon, the query you shared requires a SparkSQL job to be run which means
> latency will be high. However, I am glad you are also thinking of
> adding these directly to the partition table and it seems we share the same
> interests.


Yea the partitions table currently still goes through SparkSQL, so it will
be the same.  Maybe you mean add this to partition stats?  We do need to
reconcile partition table and partition stats at some point though.  Not
sure if it was designed/discussed yet, I think there was some thoughts on
short-circuiting Partitions table to read from Partition stats, if stats
exist for the current snapshot.

Thanks
Szehon

On Tue, May 2, 2023 at 4:34 PM Pucheng Yang 
wrote:

> Thanks Ryan and Szehon!
>
> Szehon, the query you shared requires a SparkSQL job to be run which means
> latency will be high. However, I am glad you are also thinking of
> adding these directly to the partition table and it seems we share the same
> interests. I am looking forward to the work in the phase 2 implementation.
> Let me know if I can help, thanks.
>
> On Tue, May 2, 2023 at 4:28 PM Szehon Ho  wrote:
>
>> Yea I agree, I had a handy query for the last update time of partition.
>>
>> SELECT
>>
>> e.data_file.partition,
>>
>> MAX(s.committed_at) AS last_modified_time
>>
>> FROM db.table.snapshots s
>>
>> JOIN db.table.entries e
>>
>> WHERE s.snapshot_id = e.snapshot_id
>>
>> GROUP BY by e.data_file.partition
>>
>> It's a bit lengthy currently.
>>
>> I have been indeed thinking to look at adding these fields to the
>> Partitions table directly, after Ajantha's pending changes to add delete
>> files to this table.
>>
>> Thanks
>> Szehon
>>
>> On Tue, May 2, 2023 at 4:08 PM Ryan Blue  wrote:
>>
>>> Pucheng,
>>>
>>> Rather than using the changelog, I'd just look at the metadata tables.
>>> You should be able to query the all_entries metadata table to see file
>>> additions or deletions for a given snapshot. Then from there you can join
>>> to the snapshots table for timestamps and aggregate to the partition level.
>>>
>>> Ryan
>>>
>>> On Fri, Apr 28, 2023 at 12:49 PM Pucheng Yang
>>>  wrote:
>>>
>>>> Hi Ajantha and the community,
>>>>
>>>> I am interested and I am wondering where we can see the latest progress
>>>> of this feature?
>>>>
>>>> Regarding the partition stats in Iceberg, I am specifically curious if
>>>> we can consider a new field called "last modified time" to be included for
>>>> the partitions stats (or have a plugable way to allow users to
>>>> configure partition stats they need). My use case is to find out if a
>>>> partition is changed or not given two snapshots (old and new) with a
>>>> quick and light way process. I previously was suggested by the community to
>>>> use the change log (CDC) but I think that is too heavy (I guess, since it
>>>> requires to run SparkSQL procedure) and it is over do the work (I don't
>>>> need what rows are changed, I just need true or false for whether a
>>>> partition is changed).
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 7, 2023 at 11:36 AM Mayur Srivastava <
>>>> mayur.srivast...@twosigma.com> wrote:
>>>>
>>>>> Thanks Ajantha.
>>>>>
>>>>>
>>>>>
>>>>> > It should be very easy to add a few more fields to it like the
>>>>> latest sequence number or last modified time per partition.
>>>>>
>>>>>
>>>>>
>>>>> Among sequence number and modified time, which one do you think is
>>>>> more likely to be available in Iceberg partition stats? Note that we would
>>>>> like to avoid compaction change the sequence number or modified time 
>>>>> stats.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mayur
>>>>>
>>>>>
>>>>>
>>>>> *From:* Ajantha Bhat 
>>>>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>>>>> *To:* dev@iceberg.apache.org
>>>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>>>
>

Re: tradeoffs between serializable vs snapshot isolation for single writer

2023-05-04 Thread Szehon Ho
Hi,

I believe it only matters if you have conflicting commits.  For single writer 
case, I think you are right and it should not matter, so you may save very 
slightly in performance by turning it to Snapshot Isolation.  The checks are 
metadata checks though, so I would think it will not be a signfiicant 
performance difference.

In general, the isolation levels in Iceberg work by checking before commit to 
see if there are any conflicting changes to data files about to be committed, 
from when the operation first started (ie, starting snapshot id).  So if there 
is a failure due to the isolation level, I believe the error bubbles back the 
application to try again, hence ‘pessimistic’.  

Note, metadata conflicts are automatically retried and should rarely bubble up 
to user, so only in case of data isolation level conflict (ie, you delete a 
file that is currently being rewritten by another operation), will 
error-handling be required.

Hope that helps
Szehon 

> On May 4, 2023, at 12:19 PM, Nirav Patel  wrote:
> 
> I am trying to ingest data into iceberg table using spark streaming. There 
> are no multiple writers to same data at the moment. According to iceberg api 
> 
>  default isolation level for table is serializable . I want to understand if 
> there is only a single application (single spark streaming job in my case) 
> writing to iceberg table is there any advantage or disadvantage over using 
> serializable or a snapshot isolation ? Is there any performance impact of 
> using serializable when only one application is writing to table? Also it 
> seems iceberg allows all writers to write into snapshot and use OCC to decide 
> if one needs to retry because it was late. In this case how it is 
> serializable at all? isn't serilizability achieved via pessimistic 
> concurrency control? Would like to understand how iceberg implement 
> serializable isolation level and how it is different than snapshot isolation ?
> 
> Thanks



Re: tradeoffs between serializable vs snapshot isolation for single writer

2023-05-04 Thread Szehon Ho
Whoops, I didn’t see Ryan answer already. 

> On May 4, 2023, at 3:18 PM, Szehon Ho  wrote:
> 
> Hi,
> 
> I believe it only matters if you have conflicting commits.  For single writer 
> case, I think you are right and it should not matter, so you may save very 
> slightly in performance by turning it to Snapshot Isolation.  The checks are 
> metadata checks though, so I would think it will not be a signfiicant 
> performance difference.
> 
> In general, the isolation levels in Iceberg work by checking before commit to 
> see if there are any conflicting changes to data files about to be committed, 
> from when the operation first started (ie, starting snapshot id).  So if 
> there is a failure due to the isolation level, I believe the error bubbles 
> back the application to try again, hence ‘pessimistic’.  
> 
> Note, metadata conflicts are automatically retried and should rarely bubble 
> up to user, so only in case of data isolation level conflict (ie, you delete 
> a file that is currently being rewritten by another operation), will 
> error-handling be required.
> 
> Hope that helps
> Szehon 
> 
>> On May 4, 2023, at 12:19 PM, Nirav Patel  wrote:
>> 
>> I am trying to ingest data into iceberg table using spark streaming. There 
>> are no multiple writers to same data at the moment. According to iceberg api 
>> <https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/IsolationLevel.html#:%7E:text=Both%20of%20them%20provide%20a,environments%20with%20many%20concurrent%20writers.>
>>  default isolation level for table is serializable . I want to understand if 
>> there is only a single application (single spark streaming job in my case) 
>> writing to iceberg table is there any advantage or disadvantage over using 
>> serializable or a snapshot isolation ? Is there any performance impact of 
>> using serializable when only one application is writing to table? Also it 
>> seems iceberg allows all writers to write into snapshot and use OCC to 
>> decide if one needs to retry because it was late. In this case how it is 
>> serializable at all? isn't serilizability achieved via pessimistic 
>> concurrency control? Would like to understand how iceberg implement 
>> serializable isolation level and how it is different than snapshot isolation 
>> ?
>> 
>> Thanks
> 



Re: Welcome new committers and PMC!

2023-05-05 Thread Szehon Ho
Thanks all, really appreciate it, and congrats to Eduard and Amogh !

Szehon

On Fri, May 5, 2023 at 12:37 AM Mingliang Liu  wrote:

> Congrats! All well deserved.
>
> On Thu, May 4, 2023 at 11:50 PM Eduard Tudenhoefner 
> wrote:
>
>> Thanks everyone, and also congrats to Amogh and Szehon!
>>
>> On Fri, May 5, 2023 at 3:27 AM Junjie Chen 
>> wrote:
>>
>>> Congrats, Amogh, Eduard and Szehon!
>>>
>>> On Thu, May 4, 2023 at 11:30 PM Dmitri Bourlatchkov
>>>  wrote:
>>>
 Congrats, Eduard , Amogh, and Szehon!

 On Wed, May 3, 2023 at 3:07 PM Ryan Blue  wrote:

> Hi everyone,
>
> I want to congratulate Amogh and Eduard, who were just added as
> Ierberg committers and Szehon, who was just added to the PMC. Thanks for
> all your contributions!
>
> Ryan
>
> --
> Ryan Blue
>

>>>
>>> --
>>> Best Regards
>>>
>>


Re: [VOTE] Release Apache Iceberg 1.3.0 RC0

2023-05-24 Thread Szehon Ho
+1 (binding)

1. verify signatures
2. verify checksum
3. verify license documentation
4. build and run tests
5. Ran simple tests on Spark 3.4
- Create simple table and check metadata tables
- Ran 'delete from' statement to generate position delete, and run
rewrite_position_delete

Thanks
Szehon

On Tue, May 23, 2023 at 1:21 PM Anton Okolnychyi
 wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 1.3.0 release.
>
> The commit ID is 7dbdfd33a667a721fbb21c7c7d06fec9daa30b88
> * This corresponds to the tag: apache-iceberg-1.3.0-rc0
> * https://github.com/apache/iceberg/commits/apache-iceberg-1.3.0-rc0
> *
> https://github.com/apache/iceberg/tree/7dbdfd33a667a721fbb21c7c7d06fec9daa30b88
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.3.0-rc0
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1134/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours. (Weekends excluded)
>
> [ ] +1 Release this as Apache Iceberg 1.3.0
> [ ] +0
> [ ] -1 Do not release this because...
>
> Only PMC members have binding votes, but other community members are
> encouraged to cast
> non-binding votes. This vote will pass if there are 3 binding +1 votes and
> more binding
> +1 votes than -1 votes.
>
> - Anton
>


Re: [DISCUSS] Default format version for new tables?

2023-05-24 Thread Szehon Ho
Hi,

I'm +1 to making v2 the default, say after this release.

It seems most of the features brought up as concerns on Spark side in the
thread Gabor linked have been implemented (like position delete lifecycle).

But Anton's point is also good.  Even if some delete file features are
missing, V2 is not only about delete files, which are not produced by
default in Spark, and Flink(?), but rather the fixes for partition spec
evolution / snapshot id inheritance.  Hence it makes sense to me, from that
angle.

Thanks
Szehon

On Wed, May 24, 2023 at 12:34 AM Gabor Kaszab
 wrote:

> Hey Anton,
>
> Just adding a note that back around January the same topic was brought up
> on this mail list. There the conclusion was to use the 'table-default.'
> catalog level property to create V2 tables by default.
> https://lists.apache.org/thread/9ct0p817qxqqdnv7nb35kghsfygjkqdf
>
> I'm not saying that we shouldn't default to V2 just drawing attention to
> this previous conversation.
>
> Cheers,
> Gabor
>
> On Wed, May 24, 2023 at 12:04 AM Anton Okolnychyi
>  wrote:
>
>> Hi folks,
>>
>> Would it be appropriate for us to consider changing the default table
>> format version for new tables from v1 to v2?
>>
>> I don’t think defaulting to v2 tables means all readers have to support
>> delete files. DELETE, UPDATE, MERGE operations will only produce delete
>> files if configured explicitly.
>>
>> The primary reason I am starting this thread is to avoid our workarounds
>> in v1 spec evolution, and snapshot ID inheritance. The latter is critical
>> for the performance of rewriting manifests.
>>
>> Any thoughts?
>>
>> - Anton
>
>


Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
I think this violates Iceberg’s assumption of immutable snapshots.  That
would require modifying the old snapshot to no longer point to those gc’ed
data files, else not sure how you can time-travel to read from that
snapshot, if some of its files are deleted?

That being said, I also had this thought at some point, to keep snapshot
info around longer.  I expect most organizations operate in a mode where
they expire snapshots after a few days, and reasonably expect any
time-travel or snapshot-related operation (like CDC) to happen within this
timeframe.   And of course, use tags to keep the snapshot from expiration.

But there are some use-cases where keeping more snapshot metadata for a
period longer than when it could be read could be interesting.  For
example, if I want to know info about the snapshot that added each data
file, we probably have lost most of those snapshot metadata as they were
added long ago.  Example, the frequent ask to find each partition's last
modified time, (in an earlier email thread).

I haven't thought it completely through, but it crossed my mind that a
‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
but just mark snapshot’s metadata files as expired without physically
deleting them, and so retain the ability to answer these questions.  It
could be done by adding ‘expired-snapshots’ list to metadata.json.  That
being said, its a singular use case and not sure if anyone also has
interest or other use-case?  It would add a bit of complexity.

Thanks
Szehon
Szehon

On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
wrote:

> Ryan,
>
> One use case is the user might need to time travel to a certain snapshot.
> However, such a snapshot is expired due to the snapshot expiration
> that only retains the latest snapshot operation, and this operation's only
> intent is to remove the gc partition. It seems a little overkill to me.
>
> I hope my explanation makes sense to you.
>
> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>
>> Pucheng,
>>
>> What is the use case around keeping the snapshot longer? We don't often
>> have people ask to keep snapshots that can't be read, so it sounds like you
>> might have something specific in mind?
>>
>> Ryan
>>
>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
>> wrote:
>>
>>> Hi community,
>>>
>>> In my organization, a big portion of the datasets are partitioned by
>>> date, normally we keep the latest X dates of partition for a given dataset.
>>>
>>> One issue that always bothers me is if I want to delete a partition
>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>> partition data is physically removed. However, the downside of this
>>> approach is the table snapshot history will be completely lost..
>>>
>>> I wonder if anyone else in the community has the same pain point? How do
>>> you solve this? I would love to understand if there is a solution to this
>>> otherwise we can brainstorm if there is a way to solve this.
>>>
>>> Thanks!
>>>
>>> Pucheng
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
Yea, for the original use case in this thread, agree it's delete (soft) +
expire (physical, permanent).

I guess I should have phrased my thought better, I was replying to Ryan's
question above

>  We don't often have people ask to keep snapshots that can't be read


and had thought it'd be nice to have a ExpireSnapshot mode where we
keep older metadata for longer periods of time beyond physical expiration.

But the main use case I had was table historical analysis (last update time
for each partitions, how many snapshots did this table ever have, for
example), it's more a nice-to-have and definitely not sure it is a very
compelling use-case.  Another option I guess, is custom catalog can keep
around these historical information.

Thanks
Szehon

On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer 
wrote:

> I think "soft-mode" is really just doing the delete. You can then recover
> the snapshot if you happen to have accidentally TTL'd a partition.
>
> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho  wrote:
>
>> I think this violates Iceberg’s assumption of immutable snapshots.  That
>> would require modifying the old snapshot to no longer point to those gc’ed
>> data files, else not sure how you can time-travel to read from that
>> snapshot, if some of its files are deleted?
>>
>> That being said, I also had this thought at some point, to keep snapshot
>> info around longer.  I expect most organizations operate in a mode where
>> they expire snapshots after a few days, and reasonably expect any
>> time-travel or snapshot-related operation (like CDC) to happen within this
>> timeframe.   And of course, use tags to keep the snapshot from expiration.
>>
>> But there are some use-cases where keeping more snapshot metadata for a
>> period longer than when it could be read could be interesting.  For
>> example, if I want to know info about the snapshot that added each data
>> file, we probably have lost most of those snapshot metadata as they were
>> added long ago.  Example, the frequent ask to find each partition's last
>> modified time, (in an earlier email thread).
>>
>> I haven't thought it completely through, but it crossed my mind that a
>> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
>> but just mark snapshot’s metadata files as expired without physically
>> deleting them, and so retain the ability to answer these questions.  It
>> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
>> being said, its a singular use case and not sure if anyone also has
>> interest or other use-case?  It would add a bit of complexity.
>>
>> Thanks
>> Szehon
>> Szehon
>>
>> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
>> wrote:
>>
>>> Ryan,
>>>
>>> One use case is the user might need to time travel to a certain
>>> snapshot. However, such a snapshot is expired due to the snapshot
>>> expiration that only retains the latest snapshot operation, and this
>>> operation's only intent is to remove the gc partition. It seems a little
>>> overkill to me.
>>>
>>> I hope my explanation makes sense to you.
>>>
>>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>>>
>>>> Pucheng,
>>>>
>>>> What is the use case around keeping the snapshot longer? We don't often
>>>> have people ask to keep snapshots that can't be read, so it sounds like you
>>>> might have something specific in mind?
>>>>
>>>> Ryan
>>>>
>>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang
>>>>  wrote:
>>>>
>>>>> Hi community,
>>>>>
>>>>> In my organization, a big portion of the datasets are partitioned by
>>>>> date, normally we keep the latest X dates of partition for a given 
>>>>> dataset.
>>>>>
>>>>> One issue that always bothers me is if I want to delete a partition
>>>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>>>> partition data is physically removed. However, the downside of this
>>>>> approach is the table snapshot history will be completely lost..
>>>>>
>>>>> I wonder if anyone else in the community has the same pain point? How
>>>>> do you solve this? I would love to understand if there is a solution to
>>>>> this otherwise we can brainstorm if there is a way to solve this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Pucheng
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>


Re: Iceberg old partition gc

2023-06-03 Thread Szehon Ho
>
> @Szehon, I am wondering if we can create materialized views for metadata
> tables to support infinite history on metadata tables (like snapshots or
> partitions). Obviously, materialized views can't be used for time travel or
> rollback. They are only meant for maintaining long/infinite histories.


Yea, that's a good idea, there's definitely options like building a tool
outside Iceberg (dumped it from time to time to materialized view), or
build a history-preserving catalog layer that saves old snapshot metadata,
rather than building it in Iceberg spec itself to keep expired metadata
files.

Thanks
Szehon

On Sat, Jun 3, 2023 at 10:06 AM Steven Wu  wrote:

> > the main use case I had was table historical analysis (last update time
> for each partitions, how many snapshots did this table ever have, for
> example),
>
> Partition level stats can probably help with questions like "last update
> time for each partition".
>
> @Szehon, I am wondering if we can create materialized views for metadata
> tables to support infinite history on metadata tables (like snapshots or
> partitions). Obviously, materialized views can't be used for time travel or
> rollback. They are only meant for maintaining long/infinite histories.
>
> > One use case is the user might need to time travel to a certain
> snapshot. However, such a snapshot is expired due to the snapshot
> expiration that only retains the latest snapshot operation, and this
> operation's only intent is to remove the gc partition. It seems a little
> overkill to me.
>
> @Pucheng, usually people keep Iceberg snapshot history (for time travel or
> rollback) for a few days (like 7). Very long history can burden the
> metadata system. tagging can extend the history with selective snapshots.
>
> It seems that you are saying that purging actions of old partitions are
> creating new snapshots, which are taking up some space in the snapshot
> history. But if snapshot expiration is time based (like 7 days), this
> shouldn't be a problem, right?
>
> On Fri, Jun 2, 2023 at 6:17 PM Szehon Ho  wrote:
>
>> Yea, for the original use case in this thread, agree it's delete (soft) +
>> expire (physical, permanent).
>>
>> I guess I should have phrased my thought better, I was replying to Ryan's
>> question above
>>
>>>  We don't often have people ask to keep snapshots that can't be read
>>
>>
>> and had thought it'd be nice to have a ExpireSnapshot mode where we
>> keep older metadata for longer periods of time beyond physical expiration.
>>
>> But the main use case I had was table historical analysis (last update
>> time for each partitions, how many snapshots did this table ever have, for
>> example), it's more a nice-to-have and definitely not sure it is a very
>> compelling use-case.  Another option I guess, is custom catalog can keep
>> around these historical information.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think "soft-mode" is really just doing the delete. You can then
>>> recover the snapshot if you happen to have accidentally TTL'd a partition.
>>>
>>> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho 
>>> wrote:
>>>
>>>> I think this violates Iceberg’s assumption of immutable snapshots.
>>>> That would require modifying the old snapshot to no longer point to those
>>>> gc’ed data files, else not sure how you can time-travel to read from that
>>>> snapshot, if some of its files are deleted?
>>>>
>>>> That being said, I also had this thought at some point, to keep
>>>> snapshot info around longer.  I expect most organizations operate in a mode
>>>> where they expire snapshots after a few days, and reasonably expect any
>>>> time-travel or snapshot-related operation (like CDC) to happen within this
>>>> timeframe.   And of course, use tags to keep the snapshot from expiration.
>>>>
>>>> But there are some use-cases where keeping more snapshot metadata for a
>>>> period longer than when it could be read could be interesting.  For
>>>> example, if I want to know info about the snapshot that added each data
>>>> file, we probably have lost most of those snapshot metadata as they were
>>>> added long ago.  Example, the frequent ask to find each partition's last
>>>> modified time, (in an earlier email thread).
>>>>
>>>> I haven't thought it completely through, but i

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-06-21 Thread Szehon Ho
Hi,

Yea, its definitely an issue.

Fwiw, I was looking at reviving the old effort in Spark to pass in configs
dynamically in Spark SQL statement, which is probably the cleanest
solution.  (https://github.com/apache/spark/pull/34072 was the old effort,
and I made https://github.com/apache/spark/pull/41683 based on the
suggestion from Spark community).  Will keep the list posted.

Thanks
Szehon

On Fri, Jun 16, 2023 at 1:02 PM Wing Yew Poon 
wrote:

> Hi,
> I recently put up a PR, https://github.com/apache/iceberg/pull/7790, to
> allow the write mode (copy-on-write/merge-on-read) to be specified in
> SQLConf. The use case is explained in the PR.
> Cheng Pan has an open PR, https://github.com/apache/iceberg/pull/7733, to
> allow locality to be specified in SQLConf.
> In the recent past, https://github.com/apache/iceberg/pull/6838/ was a PR
> to allow the write distribution mode to be specified in SQLConf. This was
> merged.
> Cheng Pan asks if there is any guidance on when we should allow configs to
> be specified in SQLConf.
> Thanks,
> Wing Yew
>
> ps. The above open PRs could use reviews by committers.
>
>


Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-06-26 Thread Szehon Ho
Hi,

Yea that sounds good to me.

Btw, that being said, I'm not opposed to configuring some of options in the
thread, especially write options, as sql conf either.  (Not sure this
mechanism can support write conf without some changes to parser).  And in
any case, it could be cascading: sql_dynamic_conf > sql_conf > table_conf.
Not sure what others think.

Thanks,
Szehon

On Sat, Jun 24, 2023 at 5:29 PM Manu Zhang  wrote:

> If the Spark community doesn’t accept this solution, how about adding it
> as an extension in Iceberg? I’m also wondering what people here think about
> it.
>
> Thanks for reviving the effort.
> Manu
>
> Szehon Ho 于2023年6月22日 周四00:45写道:
>
>> Hi,
>>
>> Yea, its definitely an issue.
>>
>> Fwiw, I was looking at reviving the old effort in Spark to pass in
>> configs dynamically in Spark SQL statement, which is probably the cleanest
>> solution.  (https://github.com/apache/spark/pull/34072 was the old
>> effort, and I made https://github.com/apache/spark/pull/41683 based on
>> the suggestion from Spark community).  Will keep the list posted.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Jun 16, 2023 at 1:02 PM Wing Yew Poon 
>> wrote:
>>
>>> Hi,
>>> I recently put up a PR, https://github.com/apache/iceberg/pull/7790, to
>>> allow the write mode (copy-on-write/merge-on-read) to be specified in
>>> SQLConf. The use case is explained in the PR.
>>> Cheng Pan has an open PR, https://github.com/apache/iceberg/pull/7733,
>>> to allow locality to be specified in SQLConf.
>>> In the recent past, https://github.com/apache/iceberg/pull/6838/ was a
>>> PR to allow the write distribution mode to be specified in SQLConf. This
>>> was merged.
>>> Cheng Pan asks if there is any guidance on when we should allow configs
>>> to be specified in SQLConf.
>>> Thanks,
>>> Wing Yew
>>>
>>> ps. The above open PRs could use reviews by committers.
>>>
>>>


[DISCUSS] Apache Iceberg Release 1.3.1

2023-07-06 Thread Szehon Ho
Hi

I wanted to start a discussion for whether its the right time for 1.3.1, a
patch release of 1.3.0.  It was started based on the issue found
by Xiangyang (@ConeyLiu) :
https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277.

Do people have any other bug fixes that should be included?  Also let me
know, if anyone wants to be a release manager?  If not, I can give it a
shot as well.

Thanks,
Szehon


Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-07 Thread Szehon Ho
Thanks a lot Eduard!  I think https://github.com/apache/iceberg/pull/7933
is also a good candidate as well.

Thanks,
Szehon

On Fri, Jul 7, 2023 at 9:07 AM Eduard Tudenhoefner 
wrote:

> +1 for a 1.3.1 release. I've created a 1.3.1 Milestone
> <https://github.com/apache/iceberg/pulls?q=is%3Apr+milestone%3A%22Iceberg+1.3.1%22+is%3Aclosed>
> and it would be great to also get #7621
> <https://github.com/apache/iceberg/pull/7621> in.
>
> Eduard
>
> On Fri, Jul 7, 2023 at 5:52 PM Ryan Blue  wrote:
>
>> +1 for a 1.3.1 to fix the Hive issue.
>>
>> For the Nessie changes, those seem outside what we would normally put in
>> a patch release. Patch releases are for bug fixes and aren't usually a time
>> to get other changes in for convenience. I can understand wanting to
>> unblock a Trino issue, but it doesn't seem like a good choice to me.
>>
>> In addition, why not put some of these classes in the Nessie project
>> itself? Could NessieUtil go there so that you aren't waiting on Iceberg
>> releases to fix third-party projects?
>>
>> Ryan
>>
>> On Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> It sounds good to me to have 1.3.1.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho 
>>> wrote:
>>> >
>>> > Hi
>>> >
>>> > I wanted to start a discussion for whether its the right time for
>>> 1.3.1, a patch release of 1.3.0.  It was started based on the issue found
>>> by Xiangyang (@ConeyLiu) :
>>> https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277
>>> .
>>> >
>>> > Do people have any other bug fixes that should be included?  Also let
>>> me know, if anyone wants to be a release manager?  If not, I can give it a
>>> shot as well.
>>> >
>>> > Thanks,
>>> > Szehon
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-10 Thread Szehon Ho
Thanks Eduard!  Merged all your backport prs, I will commit the last one
probably tomorrow and then we can start the release.

Thanks
Szehon

On Sun, Jul 9, 2023 at 11:53 PM Eduard Tudenhoefner 
wrote:

> I created a 1.3.x <https://github.com/apache/iceberg/tree/1.3.x> branch,
> so that we can start backporting those bug fixes.
>
> Eduard
>
> On Fri, Jul 7, 2023 at 6:52 PM Szehon Ho  wrote:
>
>> Thanks a lot Eduard!  I think https://github.com/apache/iceberg/pull/7933
>> is also a good candidate as well.
>>
>> Thanks,
>> Szehon
>>
>> On Fri, Jul 7, 2023 at 9:07 AM Eduard Tudenhoefner 
>> wrote:
>>
>>> +1 for a 1.3.1 release. I've created a 1.3.1 Milestone
>>> <https://github.com/apache/iceberg/pulls?q=is%3Apr+milestone%3A%22Iceberg+1.3.1%22+is%3Aclosed>
>>> and it would be great to also get #7621
>>> <https://github.com/apache/iceberg/pull/7621> in.
>>>
>>> Eduard
>>>
>>> On Fri, Jul 7, 2023 at 5:52 PM Ryan Blue  wrote:
>>>
>>>> +1 for a 1.3.1 to fix the Hive issue.
>>>>
>>>> For the Nessie changes, those seem outside what we would normally put
>>>> in a patch release. Patch releases are for bug fixes and aren't usually a
>>>> time to get other changes in for convenience. I can understand wanting to
>>>> unblock a Trino issue, but it doesn't seem like a good choice to me.
>>>>
>>>> In addition, why not put some of these classes in the Nessie project
>>>> itself? Could NessieUtil go there so that you aren't waiting on Iceberg
>>>> releases to fix third-party projects?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It sounds good to me to have 1.3.1.
>>>>>
>>>>> Thanks !
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho 
>>>>> wrote:
>>>>> >
>>>>> > Hi
>>>>> >
>>>>> > I wanted to start a discussion for whether its the right time for
>>>>> 1.3.1, a patch release of 1.3.0.  It was started based on the issue found
>>>>> by Xiangyang (@ConeyLiu) :
>>>>> https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277
>>>>> .
>>>>> >
>>>>> > Do people have any other bug fixes that should be included?  Also
>>>>> let me know, if anyone wants to be a release manager?  If not, I can give
>>>>> it a shot as well.
>>>>> >
>>>>> > Thanks,
>>>>> > Szehon
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>


Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-12 Thread Szehon Ho
Hi guys

Just an update on this.  Another issue came up about the new 1.3.0 function
rewrite_position_deletes (thanks Fokko for adding to the milestone).  I'm
working on that, hopefully can finish in next day or two, for this release.

Milestone for reference:
https://github.com/apache/iceberg/milestones/Iceberg%201.3.1

Thanks
Szehon

On Mon, Jul 10, 2023 at 11:14 AM Szehon Ho  wrote:

> Thanks Eduard!  Merged all your backport prs, I will commit the last one
> probably tomorrow and then we can start the release.
>
> Thanks
> Szehon
>
> On Sun, Jul 9, 2023 at 11:53 PM Eduard Tudenhoefner 
> wrote:
>
>> I created a 1.3.x <https://github.com/apache/iceberg/tree/1.3.x> branch,
>> so that we can start backporting those bug fixes.
>>
>> Eduard
>>
>> On Fri, Jul 7, 2023 at 6:52 PM Szehon Ho  wrote:
>>
>>> Thanks a lot Eduard!  I think
>>> https://github.com/apache/iceberg/pull/7933 is also a good candidate as
>>> well.
>>>
>>> Thanks,
>>> Szehon
>>>
>>> On Fri, Jul 7, 2023 at 9:07 AM Eduard Tudenhoefner 
>>> wrote:
>>>
>>>> +1 for a 1.3.1 release. I've created a 1.3.1 Milestone
>>>> <https://github.com/apache/iceberg/pulls?q=is%3Apr+milestone%3A%22Iceberg+1.3.1%22+is%3Aclosed>
>>>> and it would be great to also get #7621
>>>> <https://github.com/apache/iceberg/pull/7621> in.
>>>>
>>>> Eduard
>>>>
>>>> On Fri, Jul 7, 2023 at 5:52 PM Ryan Blue  wrote:
>>>>
>>>>> +1 for a 1.3.1 to fix the Hive issue.
>>>>>
>>>>> For the Nessie changes, those seem outside what we would normally put
>>>>> in a patch release. Patch releases are for bug fixes and aren't usually a
>>>>> time to get other changes in for convenience. I can understand wanting to
>>>>> unblock a Trino issue, but it doesn't seem like a good choice to me.
>>>>>
>>>>> In addition, why not put some of these classes in the Nessie project
>>>>> itself? Could NessieUtil go there so that you aren't waiting on Iceberg
>>>>> releases to fix third-party projects?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> It sounds good to me to have 1.3.1.
>>>>>>
>>>>>> Thanks !
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi
>>>>>> >
>>>>>> > I wanted to start a discussion for whether its the right time for
>>>>>> 1.3.1, a patch release of 1.3.0.  It was started based on the issue found
>>>>>> by Xiangyang (@ConeyLiu) :
>>>>>> https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277
>>>>>> .
>>>>>> >
>>>>>> > Do people have any other bug fixes that should be included?  Also
>>>>>> let me know, if anyone wants to be a release manager?  If not, I can give
>>>>>> it a shot as well.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Szehon
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>


Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-14 Thread Szehon Ho
Thanks all for the interest and help.  The last change we have on the 1.3.1
milestone is now merged, I will work on the release candidate and send a
vote next week.

Thanks
Szehon

On Wed, Jul 12, 2023 at 1:18 PM Fokko Driesprong  wrote:

> Hi Szehon,
>
> Thank you for the updates. I'm in favor of 1.3.1 as well. I got notified
> of a discrepancy <https://github.com/apache/iceberg/pull/8049> in Java's
> TableMetadata reader today. I have a fix here
> <https://github.com/apache/iceberg/pull/8050> against the master branch.
> Once that is in, I think it would be great to backport this to 1.3.x as
> well.
>
> Kind regards,
> Fokko
>
> Op wo 12 jul 2023 om 22:09 schreef Szehon Ho :
>
>> Hi guys
>>
>> Just an update on this.  Another issue came up about the new 1.3.0
>> function rewrite_position_deletes (thanks Fokko for adding to the
>> milestone).  I'm working on that, hopefully can finish in next day or two,
>> for this release.
>>
>> Milestone for reference:
>> https://github.com/apache/iceberg/milestones/Iceberg%201.3.1
>>
>> Thanks
>> Szehon
>>
>> On Mon, Jul 10, 2023 at 11:14 AM Szehon Ho 
>> wrote:
>>
>>> Thanks Eduard!  Merged all your backport prs, I will commit the last one
>>> probably tomorrow and then we can start the release.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Sun, Jul 9, 2023 at 11:53 PM Eduard Tudenhoefner 
>>> wrote:
>>>
>>>> I created a 1.3.x <https://github.com/apache/iceberg/tree/1.3.x>
>>>> branch, so that we can start backporting those bug fixes.
>>>>
>>>> Eduard
>>>>
>>>> On Fri, Jul 7, 2023 at 6:52 PM Szehon Ho 
>>>> wrote:
>>>>
>>>>> Thanks a lot Eduard!  I think
>>>>> https://github.com/apache/iceberg/pull/7933 is also a good candidate
>>>>> as well.
>>>>>
>>>>> Thanks,
>>>>> Szehon
>>>>>
>>>>> On Fri, Jul 7, 2023 at 9:07 AM Eduard Tudenhoefner 
>>>>> wrote:
>>>>>
>>>>>> +1 for a 1.3.1 release. I've created a 1.3.1 Milestone
>>>>>> <https://github.com/apache/iceberg/pulls?q=is%3Apr+milestone%3A%22Iceberg+1.3.1%22+is%3Aclosed>
>>>>>> and it would be great to also get #7621
>>>>>> <https://github.com/apache/iceberg/pull/7621> in.
>>>>>>
>>>>>> Eduard
>>>>>>
>>>>>> On Fri, Jul 7, 2023 at 5:52 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> +1 for a 1.3.1 to fix the Hive issue.
>>>>>>>
>>>>>>> For the Nessie changes, those seem outside what we would normally
>>>>>>> put in a patch release. Patch releases are for bug fixes and aren't 
>>>>>>> usually
>>>>>>> a time to get other changes in for convenience. I can understand 
>>>>>>> wanting to
>>>>>>> unblock a Trino issue, but it doesn't seem like a good choice to me.
>>>>>>>
>>>>>>> In addition, why not put some of these classes in the Nessie project
>>>>>>> itself? Could NessieUtil go there so that you aren't waiting on Iceberg
>>>>>>> releases to fix third-party projects?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It sounds good to me to have 1.3.1.
>>>>>>>>
>>>>>>>> Thanks !
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho 
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi
>>>>>>>> >
>>>>>>>> > I wanted to start a discussion for whether its the right time for
>>>>>>>> 1.3.1, a patch release of 1.3.0.  It was started based on the issue 
>>>>>>>> found
>>>>>>>> by Xiangyang (@ConeyLiu) :
>>>>>>>> https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277
>>>>>>>> .
>>>>>>>> >
>>>>>>>> > Do people have any other bug fixes that should be included?  Also
>>>>>>>> let me know, if anyone wants to be a release manager?  If not, I can 
>>>>>>>> give
>>>>>>>> it a shot as well.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Szehon
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>


[VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-17 Thread Szehon Ho
Hi Everyone,

I propose that we release the following RC as the official Apache Iceberg
1.3.1 release.

The commit ID is 62c34711c3f22e520db65c51255512f6cfe622c4
* This corresponds to the tag: apache-iceberg-1.3.1-rc1
* https://github.com/apache/iceberg/commits/apache-iceberg-1.3.1-rc1
*
https://github.com/apache/iceberg/tree/62c34711c3f22e520db65c51255512f6cfe622c4

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.3.1-rc1

You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on Nexus. The Maven repository URL
is:
* https://repository.apache.org/content/repositories/orgapacheiceberg-1141/

This release includes several important bug fixes over 1.3.0, including:
* Fix Spark RewritePositionDeleteFiles failure for certain partition types
(#8059)
* Fix Spark RewriteDataFiles concurrency edge-case on commit timeouts
(#7933)
* Table Metadata parser now accepts null current-snapshot-id, properties,
snapshots fields (#8064)
* FlinkCatalog creation no longer creates the default database (#8039)
* Fix loading certain V1 table branch snapshots using snapshot references
(#7621)
* Fix Spark partition-level DELETE operations for WAP branches (#7900)
* Fix HiveCatalog deleting metadata on failures in checking lock status
(#7931)

Please download, verify, and test.

Please vote in the next 72 hours. (Weekends excluded)

[ ] +1 Release this as Apache Iceberg 1.3.1
[ ] +0
[ ] -1 Do not release this because...

Only PMC members have binding votes, but other community members are
encouraged to cast
non-binding votes. This vote will pass if there are 3 binding +1 votes and
more binding
+1 votes than -1 votes.

Thanks
Szehon


Re: [VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-24 Thread Szehon Ho
questStage.java:39)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
>> at
>> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
>> ... 23 more
>>
>> Best,
>>
>> Yufei
>>
>>
>> On Sun, Jul 23, 2023 at 9:01 PM Daniel Weeks  wrote:
>>
>>> +1 (binding)
>>>
>>> Validated  license/sigs/sums/build/test.
>>>
>>> (Had the same problem with some of the S3 containerized tests as 1.3.0)
>>>
>>> -Dan
>>>
>>> On Wed, Jul 19, 2023 at 9:29 AM Eduard Tudenhoefner 
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> * validated checksum and signature
>>>> * checked license docs & ran RAT checks
>>>> * ran build and tests with JDK11
>>>> * built new docker images and ran through
>>>> https://iceberg.apache.org/spark-quickstart/
>>>> <https://iceberg.apache.org/spark-quickstart/>
>>>>
>>>> One thing I noticed is that some tests don't work when running the
>>>> build with *JDK17* (e.g. running *./gradlew build
>>>> :iceberg-flink:iceberg-flink-runtime-1.17:integrationTest -x test*
>>>> fails). This is not related to this release, but I just wanted to mention
>>>> this in case anyone else runs into this.
>>>>
>>>>
>>>> Eduard
>>>>
>>>> On Mon, Jul 17, 2023 at 8:01 PM Szehon Ho 
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I propose that we release the following RC as the official Apache
>>>>> Iceberg 1.3.1 release.
>>>>>
>>>>> The commit ID is 62c34711c3f22e520db65c51255512f6cfe622c4
>>>>> * This corresponds to the tag: apache-iceberg-1.3.1-rc1
>>>>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.3.1-rc1
>>>>> *
>>>>> https://github.com/apache/iceberg/tree/62c34711c3f22e520db65c51255512f6cfe622c4
>>>>>
>>>>> The release tarball, signature, and checksums are here:
>>>>> *
>>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.3.1-rc1
>>>>>
>>>>> You can find the KEYS file here:
>>>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>>
>>>>> Convenience binary artifacts are staged on Nexus. The Maven repository
>>>>> URL is:
>>>>> *
>>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1141/
>>>>>
>>>>> This release includes several important bug fixes over 1.3.0,
>>>>> including:
>>>>> * Fix Spark RewritePositionDeleteFiles failure for certain partition
>>>>> types (#8059)
>>>>> * Fix Spark RewriteDataFiles concurrency edge-case on commit timeouts
>>>>> (#7933)
>>>>> * Table Metadata parser now accepts null current-snapshot-id,
>>>>> properties, snapshots fields (#8064)
>>>>> * FlinkCatalog creation no longer creates the default database (#8039)
>>>>> * Fix loading certain V1 table branch snapshots using snapshot
>>>>> references (#7621)
>>>>> * Fix Spark partition-level DELETE operations for WAP branches (#7900)
>>>>> * Fix HiveCatalog deleting metadata on failures in checking lock
>>>>> status (#7931)
>>>>>
>>>>> Please download, verify, and test.
>>>>>
>>>>> Please vote in the next 72 hours. (Weekends excluded)
>>>>>
>>>>> [ ] +1 Release this as Apache Iceberg 1.3.1
>>>>> [ ] +0
>>>>> [ ] -1 Do not release this because...
>>>>>
>>>>> Only PMC members have binding votes, but other community members are
>>>>> encouraged to cast
>>>>> non-binding votes. This vote will pass if there are 3 binding +1 votes
>>>>> and more binding
>>>>> +1 votes than -1 votes.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>


[PASSED][VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-24 Thread Szehon Ho
Thanks everyone who participated in the vote for Release Apache Iceberg 1.3
.1 RC1.

The vote result is:

+1: 4 (binding), 2 (non-binding)
+0: 0 (binding), 0 (non-binding)
-1: 0 (binding), 0 (non-binding)

Therefore, the release candidate is passed.

I will work on finalizing the release.

Thanks
Szehon

On Mon, Jul 24, 2023 at 2:21 PM Szehon Ho  wrote:

> +1 (binding)
>
> 1. Verify signatures
> 2. Verify checksums
> 3. Verify license documentation
> 4. Built and ran tests, only failure is TestS3RestSigner
> 5. Ran simple queries against Spark 3.4
>
> Thanks
> Szehon
>
> On Mon, Jul 24, 2023 at 11:58 AM Ryan Blue  wrote:
>
>> +1 (binding)
>>
>> On Mon, Jul 24, 2023 at 10:44 AM Yufei Gu  wrote:
>>
>>> +1 (binding)
>>>
>>> Verified signature, checksum
>>> Verified License
>>> Built and ran tests
>>> Ran queries on Spark 3.3_2.12
>>>
>>> The test TestS3RestSigner still failed locally for me like the version
>>> 1.3.0. As Edward mentioned, it's due to Docker on Mac not being able to
>>> resolve "localhost". Given this is a maintenance version, +1 for the
>>> release.
>>>
>>> Here is the stack of the failure.
>>>
>>> > Task :iceberg-aws:test FAILED
>>>
>>> TestS3RestSigner > validatePutObject FAILED
>>> software.amazon.awssdk.core.exception.SdkClientException: Received
>>> an UnknownHostException when attempting to interact with a service. See
>>> cause for the exact endpoint that is failing to resolve. If this is
>>> happening on an endpoint that previously worked, there may be a network
>>> connectivity issue or your DNS cache could be storing endpoints for too
>>> long.
>>> at
>>> app//software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
>>> at
>>> app//software.amazon.awssdk.awscore.interceptor.HelpfulUnknownHostExceptionInterceptor.modifyException(HelpfulUnknownHostExceptionInterceptor.java:59)
>>> at
>>> app//software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202)
>>> at
>>> app//software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54)
>>> at
>>> app//software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38)
>>> at
>>> app//software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39)
>>> at
>>> app//software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
>>> at
>>> app//software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
>>> at
>>> app//software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
>>> at
>>> app//software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171)
>>> at
>>> app//software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
>>> at
>>> app//software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179)
>>> at
>>> app//software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
>>> at
>>> app//software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
>>> at
>>> app//software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
>>> at
>>> app//software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1149)
>>> at
>>> app//org.apache.iceberg.aws.s3.signer.TestS3RestSigner.before(TestS3RestSigner.java:141)
>>>
>>> Caused by:
>>> software.amazon.awssdk.core.exception.SdkClientException: Unable
>>> to execute HTTP request: iceberg-s3-signer-test.localhost
>>> at
>>> app//software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:11

[ANNOUNCE] Apache Iceberg release 1.3.1

2023-07-25 Thread Szehon Ho
I'm pleased to announce the release of Apache Iceberg 1.3.1!

Apache Iceberg is an open table format for huge analytic datasets. Iceberg
delivers high query performance for tables with tens of petabytes of data,
along with atomic commits, concurrent writes, and SQL-compatible table
evolution.

This release can be downloaded from:
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.3.1/apache-iceberg-1.3.1.tar.gz

Release notes: https://iceberg.apache.org/releases/#131-release

Java artifacts are available from Maven Central.

Thanks to everyone for contributing!

Szehon


Re: Proposal to fix the docs - this time it'll be different

2023-07-27 Thread Szehon Ho
Hi

I'm ok with putting things back in Iceberg repo, it gets more visbility
on prs.  I guess it used to be a bit distracting, but now with more
projects in Iceberg (pyiceberg, rust) we have to anyway use tags to filter
through all the mails.

Just wanted to +1 on Fokko/Ryan suggestion to avoid versioned doc
directories, I had a lot of difficulties in this part doing the last
release: https://github.com/apache/iceberg/issues/8151 , as did Anton when
I consulted him offline.

For me, replacing the 'latest' branch with a tag would be the biggest win
as it caused me the most trouble.  If we can avoid versioned docs and use
tags across the board, that would be even better, I do think all the
versions are already tagged in Github on every release, if that is your
question?

Thanks,
Szehon

On Thu, Jul 27, 2023 at 2:31 AM Brian Olsen  wrote:

> Thanks Fokko,
>
> Yeah, I think tío address that we would need to switch to a tagging that
> prefixes the different project name as a namespace within the tags space
> (e.g. pyIceberg-0.4.0, rust-0.0.1, etc…). But certainly this would result
> in an explosion of tags as we continue to introduce more projects. I’m not
> sure if this makes it difficult to find things as long as you start to
> search the prefix in GitHub it should be easy enough to find. Has anyone
> else worked on a project where this type of tagging is applied? Are their
> any performance, searching, or other implications we are missing?
>
> Bits
>
> On Thu, Jul 27, 2023 at 4:18 AM Fokko Driesprong  wrote:
>
>> Hey Brian,
>>
>> Thanks for raising this. As a release manager, I can confirm that the
>> current structure is confusing, and I can also see the community
>> struggling with this because they are willing to contribute to the docs,
>> but cannot always find the place where to do this. I think the complexity
>> of the current website mostly comes from the versioned docs. It would be
>> great if we can find a way to make this easier. Instead of using the
>> branches, we could also use the release tags and build the docs for those
>> versions.
>>
>> I think switching to mkdocs-material is a great idea. We currently also
>> use this for PyIceberg, and it works really well. My main concern is around
>> merging everything together. Should we combine Java and Python in the same
>> documentation? They have a different versioning scheme, so that would
>> create a matrix of versions. Go and Rust
>>  is also in the making,
>> so that would explode at some point.
>>
>> Cheers, Fokko
>>
>> Ps. Currently, PyIceberg uses the gh-pages branch for publishing the docs
>> .
>>
>>
>> Op do 27 jul 2023 om 00:04 schreef Brian Olsen :
>>
>>> Hey all,
>>>
>>> I have some proposals I'd like to make to fixing the docs. I would want
>>> to do this in two phases.
>>>
>>> The first phase I'm proposing that we locate all the documentation
>>> (reference docs, website, and pyIceberg) back into the apache/iceberg
>>> repository. I explain my reasoning in the attached document. This phase
>>> would also update us from Hugo to MkDocs but keep all the content the same.
>>>
>>> The second phase, is focused on iteratively building out the content
>>> that we've marked missing in some the proposal that Sam R. created along
>>> with a recent community member, Mahfuza. We will also restructure the
>>> content to following the diátaxis method (https://diataxis.fr/).
>>>
>>>
>>> https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit#heading=h.gli9mc2ghfz1
>>>
>>> Let me know what you think and bring on the questions and criticisms
>>> please! :)
>>>
>>> Bits
>>>
>>


Table owned locations

2023-08-29 Thread Szehon Ho
Hi all,

As you know, there is a recurring Iceberg issue where delete orphan file
operations may inadvertently delete other table's data, if they are
misconfigured to have the same location.

A while back, Anton had a proposal for 'owned.locations' in:
https://github.com/apache/iceberg/issues/4159 for this and similar issues.

I would like to continue progress on this and made a design doc to organize
the idea a little better.  Please take a look and comment if interested:
https://docs.google.com/document/d/1pTJPQaHwyO0NFlLcHIrXq4gBazJmAyPnigmOPMbBRR0/edit?usp=sharing

I will also try to mention it briefly in tomorrow's community sync.

Thanks
Szehon


Re: Spec change for multi-arg transform

2024-01-28 Thread Szehon Ho
Hi,

This would not be retrofitting existing partition transforms, but just
allowing for the creation of new multi-arg transforms.  Is the concern that
some implementations are never expecting new transforms to be added?  Old
implementations would indeed not be able to read Iceberg tables created
with the new transforms (this is the case even today without allowing
multi-arg transforms).

For the change, there were discussions about this from:
1. Github:  https://github.com/apache/iceberg/issues/8258 (author made an
end-to-end reference implementation there for a sample transform)
2. Google Doc Dicussion:
https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
3. August 2023 meetup :
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
(there was a general consensus there)

As there seemed to be enough consensus at the design discussion and meetup,
we did not think to go for a vote.  Though I realize this will have missed
some in the community who did not attend.  I am still not entirely sure
this is a large enough spec change for a vote (given my understanding of
the impact), but we definitely should have sent an email to dev list to get
more eyes on the above discussions to collect further concerns.

Thanks,
Szehon


On Sat, Jan 27, 2024 at 10:06 AM Micah Kornfield 
wrote:

> I think this is a good idea but I have concerns about compatibility. IMO,
> I think changing the cardinality of input columns is a large enough change
> that trying to retrofit it into V1 or V2 of the specification will cause
> pain for implementations not relying on reference implementation.  I
>
> As a secondary concern, I think it would be worthwhile for PMC to
> formalize the process around specification changes as these have broader
> implications for Iceberg adoption.  A model that I've seen work reasonably
> well in other communities is the following:
>
> 1.  Discussion of overall features on the mailing list (this can also be a
> pointer to the GitHub issue).
> 2.  2 reference implementations demonstrating the change is viable (it
> seems like PyIceberg is close to being fully functional enough that this
> will be viable in the near term).
> 3.  A formal vote adopting the change.
>
> But really any statement of policy around how specification changes
> occur (and what changes will be considered for backporting to finalized
> specifications) would be useful.
>
> Thanks,
> Micah
>
> On Sat, Jan 27, 2024 at 2:55 AM 叶先进  wrote:
>
>> Hi,
>>
>> This is just a heads up. Szehon and I just make a spec change to include
>> multi-arg transform: https://github.com/apache/iceberg/pull/8579 recently.
>> I am sending this to get input from others who did not review the pr before
>> Iceberg 1.5 release. Any concerns/suggestions are appreciated.
>>
>> After this change, we are working to get the API/Core and engine changes
>> into the iceberg and more importantly the concrete multi-arg transforms,
>> such as bucketV2 or zorder, etc.
>>
>


Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
 handle them. I am planning to
>> send it after I finish the `bucketV2` spec change, WDYT, Ryan?
>>
>> And BTW, it might introduce additional overhead to support it in the V1
>> spec, I am aiming to support this in V2 by enabling a specific table
>> property.
>>
>> YE  于2024年1月29日周一 11:27写道:
>>
>>> Thanks for Micah and Ryan's reply.
>>>
>>> As Szehon already pointed out, this change is to allow creation of *new*
>>> multi-arg transforms. I remember there's a discussion in the google doc
>>> whether targeting this as a `V3` spec change, it turns out that we may
>>> support this as long as we make sure old writers cannot
>>> write to a multi-arg transformed table. So we didn't explicitly state
>>> it's a `V3
>>>
>>> > I think the PR that was merged is missing clarity around the version
>>> of the spec that requires these changes and how to handle them in v1 and v2
>>> tables.
>>>
>>>
>>>
>>>
>>>
>>> Ryan Blue  于2024年1月29日周一 02:36写道:
>>>
>>>> Thanks for working on this, Szehon and AdvanceXY! I'm glad to see this
>>>> picking up for the v3 work.
>>>>
>>>> I also want to address Micah's comments and suggest how we can do
>>>> better next time. From Micah's suggestion, there are 3 steps: 1. Discuss
>>>> the feature, 2. Build 2 reference implementations, and 3. hold a vote.
>>>> That's very similar to what we typically do. The only difference is that
>>>> for step 2, we typically just build one reference implementation in the
>>>> Java library. We do vote on the large spec updates, but in this case you
>>>> haven't seen one since we haven't built the reference implementation yet.
>>>>
>>>> I think the confusion here comes from updating the spec markdown doc
>>>> prematurely. I think the PR that was merged is missing clarity around the
>>>> version of the spec that requires these changes and how to handle them in
>>>> v1 and v2 tables. It should be clear that this is a v3 feature and that v3
>>>> has not been formally adopted by a vote. We'll clean that up.
>>>>
>>>> While this is a v3 feature and must be supported for v3 compatibility,
>>>> the community usually also has guidelines for using features like this with
>>>> older spec versions. For example, before releasing v2 we allowed snapshot
>>>> ID inheritance in v1 by enabling a table property. That allowed people that
>>>> could ensure their versions supported it or who were okay with errors to
>>>> use the feature before v2 was released. I think we'd want to do the same
>>>> thing here. The reference implementation can read but not write tables that
>>>> have unknown partition transforms. We need to be clear about the details,
>>>> but I think this is generally a good idea -- I'm curious what you think
>>>> about it, Micah.
>>>>
>>>> Ryan
>>>>
>>>> On Sun, Jan 28, 2024 at 8:01 AM Szehon Ho 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This would not be retrofitting existing partition transforms, but just
>>>>> allowing for the creation of new multi-arg transforms.  Is the concern 
>>>>> that
>>>>> some implementations are never expecting new transforms to be added?  Old
>>>>> implementations would indeed not be able to read Iceberg tables created
>>>>> with the new transforms (this is the case even today without allowing
>>>>> multi-arg transforms).
>>>>>
>>>>> For the change, there were discussions about this from:
>>>>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author
>>>>> made an end-to-end reference implementation there for a sample transform)
>>>>> 2. Google Doc Dicussion:
>>>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>>>>> 3. August 2023 meetup :
>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>>>>> (there was a general consensus there)
>>>>>
>>>>> As there seemed to be enough consensus at the design discussion and
>>>>> meetup, we did not think to go for a vote.  Though I realize this will 
>>>>> have
>>>

Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
Sorry I may have misunderstood the statement and maybe this is specific to
multi-arg transform, in any case let's get a spec pr earlier in to
discuss/specify behavior for V1-2 vs 3.

Thanks
Szehon

On Tue, Jan 30, 2024 at 9:23 AM Szehon Ho  wrote:

> Thanks all for the discussion.
>
> For the specific point about any new transform being able to be read in
> current versions but only written in V3 (which I missed as well):
>
> While this is a v3 feature and must be supported for v3 compatibility, the
>> community usually also has guidelines for using features like this with
>> older spec versions. For example, before releasing v2 we allowed snapshot
>> ID inheritance in v1 by enabling a table property. That allowed people that
>> could ensure their versions supported it or who were okay with errors to
>> use the feature before v2 was released. I think we'd want to do the same
>> thing here. The reference implementation can read but not write tables that
>> have unknown partition transforms. We need to be clear about the details,
>> but I think this is generally a good idea -- I'm curious what you think
>> about it, Micah.
>
>
> I feel this implies adding versions to partition transforms (ie, this
> partition transform is in V1, V2, or V3).  And any new partition transform
> (even single-arg transform, not just multi-arg transform) will be in
> VNext.  Should we clarify that in the spec as a general point, even outside
> multi-arg transforms?  Let me know if I mis-understood.
>
> Ryan's right. I will send a new PR to clarify that this change is
>> targeting V3 spec and how V1 and V2 should handle them. I am planning to
>> send it after I finish the `bucketV2` spec change, WDYT, Ryan?
>
>
> Advancedxy, I feel we can prioritize this pr in parallel for review, as it
> seems important to get in soon as well?  Although, currently the PR was
> just to say that partition transforms may be multi-arg, but doesnt add any
> new ones, so I feel it makes sense that we can clarify it at the same time
> we add bucketV2 to the spec.
>
> Thanks
> Szehon
>
> On Mon, Jan 29, 2024 at 10:41 AM Micah Kornfield 
> wrote:
>
>> Thanks YE, Ryan and Szehon for your thoughts.
>>
>> As was already touched on, my primary concern is it seems like features
>> were being added to V2 that were not forward compatible.  It seems there is
>> consensus that these will be V3 and possibly backported to V2, this
>> makes much more sense.  I've added some more detailed responses below on
>> the process.  My answers are a little verbose, so as a summary:
>>
>> IMO Specification additions have a broad enough impact that I think they
>> warrant a higher degree of formality.  This includes:
>> 1.  Documenting what is required for new spec changes.
>> 2.  Ideally each spec change would have a first class DISCUSS [4] and
>> VOTE thread on the mailing list before it is committed (the exact process
>> is likely something for the community to decide but I think this most
>> closely reflects the intent of the "Apache Way" [2]).  I think this becomes
>> more important as different github repos are spun out for different
>> implementations, as there will likely be some community members who do not
>> keep track for the main repo.
>>
>> For the change, there were discussions about this from:
>>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author made
>>> an end-to-end reference implementation there for a sample transform)
>>> 2. Google Doc Dicussion:
>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>>> 3. August 2023 meetup :
>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>>> (there was a general consensus there)
>>
>>
>> Part of "The Apache Way" is "if it didn't happen on the mailing list it
>> didn't happen" [2]. Different Apache projects have different standards in
>> this regard, and with a lot of development becoming Github focused it seems
>> some aspects of "The Apache Way" are interpreted differently.
>>
>> I think the exact process details are less important than having an
>> agreed upon and documented process for new specification features (it
>> sounds like there might be an agreed upon process but I don't think I've
>> seen it documented, apologies if I missed it). There is a separate active
>> thread "Process for creating new Proposals", that is probably more
>> appropriate to further this part of the discuss

Re: Materialized view integration with REST spec

2024-02-19 Thread Szehon Ho
Hi,

Great to see more discussion on the MV spec.  Actually, Jan's document "Iceberg
Materialized View Spec"

has
been organized , with a "Design Questions" section to track these debates,
and it would be nice to centralize the debates there, as Micah mentions.

For Dan's question, I think this debate was tracked in "Design Question 3:
Should the storage table be registered in the catalog?". I think the
general idea there was to not expose it directly via Catalog as it is then
exposed to user modification. If the engine wants to access anything about
the storage table (including audit and storage), it is of course there via
the storage table pointer. I think Walaa's point is also good, we could
expose it as we expose metadata tables, but I am still not sure if there is
still some use-cases of engine access not covered?

It is true that for Jack's initial question (Do we really want to go with
the MV = view + storage table design approach for Iceberg MV?),
unfortunately we did not capture it as a "Design Question" in Jan's doc, as
it was an implicit assumption of 'yes', because it is the choice of Hive,
Trino, and other engines , as others have pointed out.

Jack's point about potential evolution of MV (like to add partitioning) is
an interesting one, but definitely hard to grasp.  I think it makes sense
to add this as a separate Design Question in the doc, and add the options.
This will allow us to flesh out this alternative option(s).  Maybe Micah's
point about modifying existing proposal to 'embed' the required table
metadata fields in the existing view metadata, is one middle ground
option.  Or we add a totally new MV object spec for MV, separate than
existing View spec?

Also , as Jack pointed out, it may make sense to have the REST / Catalog
API proposal in the doc to educate the above decision.

Thanks
Szehon

On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa 
wrote:

> I think it would help if we answer the question of whether an MV is a view
> + storage table (and degree of exposing this underlying implementation) in
> the context of the user interfacing with those concepts:
>
> For the end user, interfacing with the engine APIs (e.g., through SQL),
> materialized view APIs should be almost the same as regular view APIs
> (except for operations specific to materialized views like REFRESH command
> etc). Typically, the end user interacts with the (materialized) view object
> as a view, and the engine performs the abstraction over the storage table.
>
> For the engines interfacing with Iceberg, it sounds the correct
> abstraction at this layer is indeed view + storage table, and engines could
> have access to both objects to optimize queries.
>
> So in a sense, the engine will ultimately hide most of the storage detail
> from the end user (except for advanced users who want to explicitly access
> the storage table with a modifier like "db.view.storageTable" -- and they
> can only read it), while Iceberg will expose the storage details to the
> engine catalog to use it in scans if needed. So the storage table is hidden
> or exposed based on the context/the actual users. From Iceberg point of
> view (which interacts with the engines), the storage table is exposed. Note
> that this does not necessarily mean that the storage table is registered in
> the catalog with its own independent name (e.g., where we can drop the view
> but keep the storage table and access it from the catalog). Addressing the
> storage table using a virtual namespace like "db.view.storageTable" sounds
> like a good middle ground. Anyways, end users should not need to directly
> access the storage table in most cases.
>
> Thanks,
> Walaa.
>
> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield 
> wrote:
>
>> Hi Jack,
>>
>>
>>> In my mind, the first key point we all need to agree upon to move this
>>> design forward is*: Do we really want to go with the MV = view +
>>> storage table design approach for Iceberg MV?*
>>
>>
>> I think we want this to the extent that we do not want to redefine the
>> same concept with different representations/naming to the greatest degree
>> possible.  This is why borrowing the concepts from the view (e.g. multiple
>> ways of expressing the same view logic in different dialects) and aspects
>> of the materialized data (e.g. partitioning, ordering) feels most natural.
>> IIUC your proposal, I think you are saying maybe two modifications to the
>> existing proposals in the document:
>>
>> 1.  No separate storage table link, instead embed most of the metadata of
>> the materialized table into the MV document (the exception seems to be
>> snapshot history)
>> 2.  For snapshot history, have one unified history specific to the MV.
>>
>> This seems fairly reasonable to me and I think I can solve some
>> challenges with the existing proposal in an elegant way.  If this is
>> correct (or maybe if it isn't quite correct) perhaps you 

Re: Materialized view integration with REST spec

2024-02-21 Thread Szehon Ho
Thanks Jan.  +1 on having just one thread per question for
vote/preference.  Where do you suggest we have it, on the discussion
question itself?  It would be to keep the existing threads and move it
there.

Also, I think it makes sense with making a slack channel (for quick
question, reply) , and also discuss unresolved questions in the next week's
sync or a separate meeting.

On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul 
wrote:

> Thank you Jack for driving the consensus for the MV spec and thank you all
> for the discussion.
>
> I really like the idea about incremental consensus because we often loose
> sight in detailed discussions. As Jack mentioned, the highest priority
> question currently is:
>
> *Should the Iceberg MV be realized as a view + storage table or do we
> define a new metadata format? *To have one place for the discussion, I
> created another Question (
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
> to the Materialized View Spec google document.
>
> To improve the visibility of the arguments I would like to propose a new
> process. It would be great if all relevant information is stored in the
> document itself. Therefore I would suggest to use the comment threads for
> smaller, temporary discussions which can be resolved by adding the points
> to the main document. Please close the threads if the information was added
> to the document. Additionally, I gave you all permissions to edit the
> documents, so you can add missing points yourselves.
>
> Of course we also need threads that express our preferences (voting). I
> would suggest to keep these separate from discussions about single points
> so that they can be persisted in the document.
>
> After a phase of collecting arguments for the different designs I think it
> would make sense to have video call to have a face to face discussion.
>
> What do you think?
>
> Best wishes,
>
> Jan
> On 20.02.24 21:32, Manish Malhotra wrote:
>
> Very excited for MV to be in Iceberg :)
> Keeping in the same doc. would be helpful, to have the trail.
> But also agreed, if there are too many directions/threads, then keep
> closing the old one, if there are no more questions.
> And put down the assumptions for the initial version to move forward.
>
>
> On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> I would vote to keep a log in the doc with open questions, and keep the
>> doc updated with open questions as they arise/get resolved.
>>
>> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye  wrote:
>>
>>> Thanks for the response from everyone!
>>>
>>> Before proceeding further, I see a few people referring back to the
>>> current design from Jan. I specifically raised this thread based on the
>>> information in the doc and a few latest discussions we had there. Because
>>> there are many threads in the doc, and each thread points further to other
>>> discussion threads in the same doc or other doc, it is now quite hard to
>>> follow and continue discussing all different topics there.
>>>
>>> I hope we can make incremental consensus of the questions in the doc
>>> through devlist, because it provides more visibility, and also a single
>>> thread instead of multiple threads going on at the same time. If we think
>>> this format is not effective, I propose that we create a new mv channel in
>>> Iceberg Slack workspace, and people interested can join and discuss all
>>> these points directly. What do we think?
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Great to see more discussion on the MV spec.  Actually, Jan's document 
>>>> "Iceberg
>>>> Materialized View Spec"
>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
>>>>  has
>>>> been organized , with a "Design Questions" section to track these debates,
>>>> and it would be nice to centralize the debates there, as Micah mentions.
>>>>
>>>> For Dan's question, I think this debate was tracked in "Design Question
>>>> 3: Should the storage table be registered in the catalog?". I think the
>>>> general idea there was to not expose it directly via Catalog as it is then
>>>> exposed to user modification. If the engine wants to access anything about
>>>> the storage table (including audit and sto

Re: Materialized view integration with REST spec

2024-02-22 Thread Szehon Ho
Hi Jan

I agree with Walaa, I think the new Question should be narrow (View = View
+ Materialization, or new MV metadata), with 3 options (Materialization can
be metadata.json or nested object).

We can mention that with the former, we have another decision whether to
register it (and then refer to Question 3 already discussed in the
document).  Otherwise we will have n^2 options here and its hard to
understand.

What do you think?
Thanks
Szehon

On Thu, Feb 22, 2024 at 1:52 AM Jan Kaul 
wrote:

> My motivation for the current table is to answer the question:
>
> *Do we use a View + a Storage Table or do we define a new MV metadata
> format? *To be able to provide meaningful arguments about the View +
> Storage Table option, I split it into multiple options. Otherwise arguments
> would always need to include an additional condition like:
>
> The downside of the View + Storage Table design is that two entities have
> to be registered in the catalog, if the storage table metadata is not
> stored as a JSON file or as an internal field.
>
> We can come back to the more granular questions once the aforementioned
> question is answered.
> On 22.02.24 06:04, Walaa Eldin Moustafa wrote:
>
> Thanks Jack! I feel Question 0 is very broad, essentially capturing the
> whole design. Can we start by discussing more granular questions?
>
> On Wed, Feb 21, 2024 at 8:53 PM Jack Ye  wrote:
>
>> Thanks everyone for the help in organizing the thoughts!
>>
>> I have moved the summary of everyone's comments here also to the doc that
>> Jan linked under question 0. We can continue to have more discussions there
>> and cast votes!
>>
>> Best,
>> Jack Ye
>>
>> On Wed, Feb 21, 2024 at 12:14 PM Jan Kaul 
>>  wrote:
>>
>>> Thanks Micah, I think the voting chips are great.
>>>
>>> @Szehon, actually what I had in mind was not to have one thread per
>>> question but rather have smaller threads that can be resolved more easily.
>>> I have the fear that one thread for the current question would lead to a
>>> very long and unmanageable discussion.
>>>
>>> I've added another row to the table where everyone could provide a
>>> summary of their reason for choosing a certain design. This way we could
>>> move some of the content from the comment threads to the main document.
>>> On 21.02.24 19:58, Micah Kornfield wrote:
>>>
>>> Of course we also need threads that express our preferences (voting). I
>>>> would suggest to keep these separate from discussions about single points
>>>> so that they can be persisted in the document.
>>>
>>>
>>> Not sure if it helpful, but I added voting chips Question 0, as maybe an
>>> easier way to keep track of votes.  If it is helpful, I can add them in
>>> other places that still need a vote (I think one needs a paid Google Docs
>>> account to insert them).
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho 
>>> wrote:
>>>
>>>> Thanks Jan.  +1 on having just one thread per question for
>>>> vote/preference.  Where do you suggest we have it, on the discussion
>>>> question itself?  It would be to keep the existing threads and move it
>>>> there.
>>>>
>>>> Also, I think it makes sense with making a slack channel (for quick
>>>> question, reply) , and also discuss unresolved questions in the next week's
>>>> sync or a separate meeting.
>>>>
>>>> On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul 
>>>>  wrote:
>>>>
>>>>> Thank you Jack for driving the consensus for the MV spec and thank you
>>>>> all for the discussion.
>>>>>
>>>>> I really like the idea about incremental consensus because we often
>>>>> loose sight in detailed discussions. As Jack mentioned, the highest
>>>>> priority question currently is:
>>>>>
>>>>> *Should the Iceberg MV be realized as a view + storage table or do we
>>>>> define a new metadata format? *To have one place for the discussion,
>>>>> I created another Question (
>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
>>>>> to the Materialized View Spec google document.
>>>>>
>>>>> To improve the visibility of the arguments I would like to propose a
>>>>> new process. It would be great if all relevant information is stored in 
>>>

Re: Materialized view integration with REST spec

2024-02-29 Thread Szehon Ho
Hi

Yes I mostly agree with the assessment.  To clarify a few minor points.

is a materialized view a view and a separate table, a combination of the
> two (i.e. commits are combined), or a new metadata type?


For 'new metadata type', I consider mostly Jack's initial proposal of a new
Catalog MV object that has two references (ViewMetadata + TableMetadata).

The arguments that I see for a combined materialized view object are:
>
>- Regular views are separate, rather than being tables with SQL and no
>data so it would be inconsistent (“Iceberg view is just a table with no
>data but with representations defined. But we did not do that.”)
>
>
>- Materialized views are different objects in DDL
>
>
>- Tables may be a superset of functionality needed for materialized
>views
>
>
>- Tables are not typically exposed to end users — but this isn’t
>required by the separate view and table option
>
> For completeness, there seem to be a few additional ones (mentioned in the
Slack and above messages).

   - Lack of spec change (to ViewMetadata).  But as Jack says it is a spec
   change (ie, to catalogs)
   - A single call to get the View's StorageTable (versus two calls)
   - A more natural API, no opportunity for user to call
   Catalog.dropTable() and renameTable() on storage table


*Thoughts:  *I think the long discussion sessions we had on Slack
was fruitful for me, as seeing the API clarified some things.

I was initially more in favor of MV being a new metadata type
(TableMetadata + ViewMetadata).  But seeing most of the MV operations end
up being ViewCatalog or Catalog operations, I am starting to think API-wise
that it may not align with the new metadata type (unless we define
MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).

Initially one question I had for option 'a view and a separate table', was
how to make this table reference (metadata.json or catalog reference).  In
the previous option, we had a precedent of Catalog references to Metadata,
but not pointers between Metadatas.  I initially saw the proposed Catalog's
TableIdentifier pointer as 'polluting' catalog concerns in ViewMetadata.
(I saw Catalog and ViewCatalog as a layer above TableMetadata and
ViewMetadata).  But I think Dan in the Slack made a fair point that
ViewMetadata already is tightly bound with a Catalog.  In this case, I
think this approach does have its merits as well in aligning Catalog API's
with the metadata.

Thanks
Szehon



On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul 
wrote:

> Hi all,
>
> I would like to provide my perspective on the question of what a
> materialized view is and elaborate on Jack's recent proposal to view a
> materialized view as a catalog concept.
>
> Firstly, let's look at the role of the catalog. Every entity in the
> catalog has a *unique identifier*, and the catalog provides methods to
> create, load, and update these entities. An important thing to note is that
> the catalog methods exhibit two different behaviors: the *create and load
> methods deal with the entire entity*, while the *update(commit) method
> only deals with partial changes* to the entities.
>
> In the context of our current discussion, materialized view (MV) metadata
> is a union of view and table metadata. The fact that the update method
> deals only with partial changes, enables us to *reuse the existing
> methods for updating tables and views*. For updates we don't have to
> define what constitutes an entire materialized view. Changes to a
> materialized view targeting the properties related to the view metadata
> could use the update(commit) view method. Similarly, changes targeting the
> properties related to the table metadata could use the update(commit) table
> method. This is great news because we don't have to redefine view and table
> commits (requirements, updates).
> This is shown in the fact that Jack uses the same operation to update the
> storage table for Option 1 and 3:
>
> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
> // non-REST: update JSON files at table_metadata_location
> storageTable.newAppend().appendFile(...).commit();
>
> The open question is *whether the create and load methods should treat
> the properties that constitute the MV metadata as two entities (View +
> Table) or one entity (new MV object)*. This is all part of Jack's
> proposal, where Option 1 proposes a new MV object, and Option 3 proposes
> two separate entities. The advantage of Option 1 is that it doesn't require
> two operations to load the metadata. On the other hand, the advantage of
> Option 3 is that no new operations or catalogs have to be defined.
>
> In my opinion, defining a new representation for materialized views
> (Option 1) is generally the cleaner solution. However, I see a path where
> we could first introduce Option 3 and still have the possibility to
> transition to Option 1 if needed. The great thing about Option 3 is that it
> only requires minor changes t

Re: [VOTE] Release Apache Iceberg 1.5.0 RC4

2024-03-01 Thread Szehon Ho
+1 (binding)

- Verified signature
- Verified checksum
- RAT check
- Compiled
- Manually ran basic queries on Spark 3.5

On Fri, Mar 1, 2024 at 6:13 AM Fokko Driesprong  wrote:

> +1 (binding)
>
> - Checked checksum and signature
> - Ran a modified version of dbt-spark to take advantage of the views, and
> it worked like a charm! 🥳
>
> Cheers, Fokko
>
> Op vr 1 mrt 2024 om 06:43 schreef Ajantha Bhat :
>
>> Gentle reminder.
>>
>> On Wed, Feb 28, 2024 at 8:34 PM Eduard Tudenhoefner 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> * validated checksum and signature
>>> * checked license docs & ran RAT checks
>>> * ran build and tests with JDK11
>>> * built new docker images and ran through
>>> https://iceberg.apache.org/spark-quickstart/
>>> * tested with Trino & Presto
>>> * tested view support with Spark 3.5 + JDBC/REST catalog
>>> * tested view behavior when creating/reading/dropping views from
>>> Spark/Trino using the diff from
>>> https://github.com/trinodb/trino/pull/19818
>>>
>>> Eduard
>>>
>>> On Wed, Feb 28, 2024 at 1:55 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 (non binding)

 I checked:
 - Signature and checksum are OK
 - Build is OK on the source distribution
 - ASF headers are present
 - No binary file found in the source distribution
 - Tested on iceland (sample project) + trino and also JDBC Catalog

 Thanks !
 Regards
 JB

 On Tue, Feb 27, 2024 at 1:16 PM Ajantha Bhat 
 wrote:
 >
 > Hi Everyone,
 >
 > I propose that we release the following RC as the official Apache
 Iceberg 1.5.0 release.
 >
 > The commit ID is e39ec185d7879c1a310769d33e0b1b6ad12486a9
 > * This corresponds to the tag: apache-iceberg-1.5.0-rc4
 > * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc4
 > *
 https://github.com/apache/iceberg/tree/e39ec185d7879c1a310769d33e0b1b6ad12486a9
 >
 > The release tarball, signature, and checksums are here:
 > *
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc4
 >
 > You can find the KEYS file here:
 > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
 >
 > Convenience binary artifacts are staged on Nexus. The Maven
 repository URL is:
 > *
 https://repository.apache.org/content/repositories/orgapacheiceberg-1158/
 >
 > Please download, verify, and test.
 >
 > Please vote in the next 72 hours.
 >
 > [ ] +1 Release this as Apache Iceberg 1.5.0
 > [ ] +0
 > [ ] -1 Do not release this because...
 >
 > Only PMC members have binding votes, but other community members are
 encouraged to cast
 > non-binding votes. This vote will pass if there are 3 binding +1
 votes and more binding
 > +1 votes than -1 votes.
 >
 > - Ajantha

>>>


Re: New committer: Bryan Keller

2024-03-05 Thread Szehon Ho
Congratulations Bryan, well deserved, great work on Iceberg !

On Tue, Mar 5, 2024 at 8:14 AM Jack Ye  wrote:

> Congrats Bryan!
>
> -Jack
>
> On Tue, Mar 5, 2024 at 7:33 AM Amogh Jahagirdar  wrote:
>
>> Congratulations Bryan! Very well deserved, thank you for all your
>> contributions!
>>
>> On Tue, Mar 5, 2024 at 7:29 AM Steven Wu  wrote:
>>
>>> Bryan, congratulations and thank you for your many contributions.
>>>
>>> On Tue, Mar 5, 2024 at 5:54 AM Bryan Keller  wrote:
>>>
 Thanks everyone! I really appreciate it, Iceberg has been inspiring to
 me, both the project itself and the people involved, so I’m thankful to
 have been given the opportunity to contribute!

 On Tue, Mar 5, 2024 at 5:28 AM Mehul Batra 
 wrote:

> Congratulations Bryan!
>
> On Tue, Mar 5, 2024 at 1:50 PM Fokko Driesprong 
> wrote:
>
>> Hi everyone,
>>
>> The Project Management Committee (PMC) for Apache Iceberg has invited
>> Bryan Keller to become a committer and we are pleased to announce that he
>> has accepted.
>>
>> Bryan was contributing to Iceberg before it was even open-source, did
>> a lot of work on the topic of metadata generation, and is now leading the
>> effort of migrating the Kafka Connect integration into OSS Iceberg.
>>
>> Being a committer enables easier contribution to the project since
>> there is no need to go via the patch submission process. This should 
>> enable
>> better productivity. A PMC member helps manage and guide the direction of
>> the project.
>>
>> Please join me in congratulating Bryan.
>>
>> Cheers,
>> Fokko
>>
>


Re: [VOTE] Release Apache Iceberg 1.5.0 RC6

2024-03-08 Thread Szehon Ho
+1 (binding)

* Verified signature
* Verified checksum
* RAT check
* built JDK 11
* Ran basic tests on Spark 3.5

Thanks
Szehon

On Fri, Mar 8, 2024 at 5:50 PM Amogh Jahagirdar  wrote:

> +1 non-binding
>
> Verified signatures,checksums,RAT checks, build, and tests with JDK11. I
> also ran ad-hoc tests for views in Trino with the rest catalog.
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Fri, Mar 8, 2024 at 5:04 PM Ryan Blue  wrote:
>
>> +1 (binding)
>>
>> - Normal tarball verification
>> - Read from my broken view successfully
>>
>> On Fri, Mar 8, 2024 at 3:07 PM Daniel Weeks  wrote:
>>
>>> +1 (binding)
>>>
>>> Verified sigs/sums/license/build/tests (Java 17)
>>>
>>> -Dan
>>>
>>> On Thu, Mar 7, 2024 at 2:10 PM Hussein Awala  wrote:
>>>
 +1 (non-binding)
 - checked checksum and signature
 - built from source with jdk11
 - tested read and write with Spark 3.5.1 and Glue catalog

 All looks good

 On Thu, Mar 7, 2024 at 10:49 PM Drew  wrote:

> +1 (non-binding)
>
> - verified signature and checksum
> - verified RAT license check
> - verified build/tests passing with JDK17
> - ran some manual tests on Spark3.5 with GlueCatalog
>
> Drew
>
> On Thu, Mar 7, 2024 at 4:38 AM Ajantha Bhat 
> wrote:
>
>> +1 (non-binding)
>>
>> * validated checksum and signature
>> * checked license docs & ran RAT checks
>> * ran build and tests with JDK11
>> * *verified view support for Nessie catalog with Spark 3.5.*
>> * *verified this RC against Trino
>> (https://github.com/trinodb/trino/pull/20957
>> )*
>>
>> - Ajantha
>>
>>
>> On Wed, Mar 6, 2024 at 7:25 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> - checksums and signatures are OK
>>> - ASF headers are present
>>> - No unexpected binary files in the source distribution
>>> - Build OK with JDK11
>>> - JdbcCatalog tested on Trino and Iceland
>>> - No unexpected artifact distributed
>>>
>>> Thanks !
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Mar 6, 2024 at 12:04 AM Ajantha Bhat 
>>> wrote:
>>> >
>>> > Hi Everyone,
>>> >
>>> > I propose that we release the following RC as the official Apache
>>> Iceberg 1.5.0 release.
>>> >
>>> > The commit ID is 2519ab43d654927802cc02e19c917ce90e8e0265
>>> > * This corresponds to the tag: apache-iceberg-1.5.0-rc6
>>> > *
>>> https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc6
>>> > *
>>> https://github.com/apache/iceberg/tree/2519ab43d654927802cc02e19c917ce90e8e0265
>>> >
>>> > The release tarball, signature, and checksums are here:
>>> > *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc6
>>> >
>>> > You can find the KEYS file here:
>>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>> >
>>> > Convenience binary artifacts are staged on Nexus. The Maven
>>> repository URL is:
>>> > *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1161/
>>> >
>>> > Please download, verify, and test.
>>> >
>>> > Please vote in the next 72 hours.
>>> >
>>> > [ ] +1 Release this as Apache Iceberg 1.5.0
>>> > [ ] +0
>>> > [ ] -1 Do not release this because...
>>> >
>>> > Only PMC members have binding votes, but other community members
>>> are encouraged to cast
>>> > non-binding votes. This vote will pass if there are 3 binding +1
>>> votes and more binding
>>> > +1 votes than -1 votes.
>>> >
>>> > - Ajantha
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: New committer: Renjie Liu

2024-03-11 Thread Szehon Ho
Congratulations!

On Mon, Mar 11, 2024 at 12:43 PM Jack Ye  wrote:

> Congratulations Renjie!
>
> Best,
> Jack Ye
>
> On Mon, Mar 11, 2024, 8:24 AM Ryan Blue  wrote:
>
>> Congratulations, Renjie! Thanks for all your contributions!
>>
>> On Mon, Mar 11, 2024 at 12:52 AM Eduard Tudenhoefner 
>> wrote:
>>
>>> Congrats Renjie!
>>>
>>> On Mon, Mar 11, 2024 at 7:49 AM Zheng Hu  wrote:
>>>
 Congrats Renjie !

 On Mon, Mar 11, 2024 at 9:39 AM Ajantha Bhat 
 wrote:

> Congrats Renjie.
>
> On Mon, Mar 11, 2024 at 6:44 AM Amogh Jahagirdar 
> wrote:
>
>> Congrats Renjie! Very well deserved, excited to see Rust support grow
>> further!
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Sun, Mar 10, 2024 at 5:57 PM Brian Olsen 
>> wrote:
>>
>>> Renjie,
>>>
>>> I’ve already enjoyed all of our interactions, All I’ve heard in my
>>> first year heavily interacting with the data community is asking about 
>>> Rust
>>> support. I’m looking forward to seeing Iceberg Rust take Iceberg 
>>> adoption
>>> to the top! Well deserved!
>>>
>>> On Sun, Mar 10, 2024 at 7:40 PM Renjie Liu 
>>> wrote:
>>>
 Thanks, everyone!

 On Mon, Mar 11, 2024 at 12:45 AM Jan Kaul
  wrote:

> Congrats!
>
> Am 09.03.2024 22:38 schrieb Micah Kornfield  >:
>
> Congrats
>
> On Saturday, March 9, 2024, Hussein Awala 
> wrote:
>
> Congrats Renjie!
>
> On Sat, Mar 9, 2024 at 8:55 PM Yufei Gu 
> wrote:
>
> Congratulations and thanks for the great work in rust iceberg,
> Renjie!
>
> Yufei
>
>
> On Sat, Mar 9, 2024 at 11:39 AM Steven Wu 
> wrote:
>
> Congrats, Renjie!
>
> On Sat, Mar 9, 2024 at 7:18 AM himadri pal 
> wrote:
>
> Congratulations Renjie.
>
> Regards,
> Himadri Pal
>
>
> On Fri, Mar 8, 2024 at 11:56 PM Fokko Driesprong 
> wrote:
>
> Hi everyone,
>
> The Project Management Committee (PMC) for Apache Iceberg has
> invited Renjie Liu to become a committer and we are pleased to 
> announce
> that he has accepted. We're very excited to have Renjie as a 
> committer as
> he's leading the effort of bringing Iceberg to the Rust world.
>
> Being a committer enables easier contribution to the project since
> there is no need to go via the patch submission process. This should 
> enable
> better productivity. A PMC member helps manage and guide the 
> direction of
> the project.
>
> Please join me in congratulating Renjie.
>
> Cheers,
> Fokko
>
>
>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
gt;>>>>>>> which we can discuss more later after the current topic is 
>>>>>>>>>>>>> resolved.
>>>>>>>>>>>>> 2. I removed the considerations for REST integration since
>>>>>>>>>>>>> from the other thread we have clarified that they should be 
>>>>>>>>>>>>> considered
>>>>>>>>>>>>> completely separately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Why I come as a proponent of having a new MV object with
>>>>>>>>>>>>> table and view metadata file pointer*
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my sheet, there are 3 options that do not have major
>>>>>>>>>>>>> problems:
>>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in view
>>>>>>>>>>>>> object
>>>>>>>>>>>>> Option 5: New MV object with table and view metadata file
>>>>>>>>>>>>> pointer
>>>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>>>
>>>>>>>>>>>>> I originally excluded option 2 because I think it does not
>>>>>>>>>>>>> align with the REST spec, but after the other discussion thread 
>>>>>>>>>>>>> about "Inconsistency
>>>>>>>>>>>>> between REST spec and table/view spec", I think my original 
>>>>>>>>>>>>> concern no
>>>>>>>>>>>>> longer holds true so now I put it back. And based on my
>>>>>>>>>>>>> personal preference that MV is an independent object that should 
>>>>>>>>>>>>> be
>>>>>>>>>>>>> separated from view and table, plus the fact that option 5 is 
>>>>>>>>>>>>> probably less
>>>>>>>>>>>>> work than option 6 for implementation, that is how I come as a 
>>>>>>>>>>>>> proponent of
>>>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 
>>>>>>>>>>>>> 6 all
>>>>>>>>>>>>> under the same category of "A combination of a view and a
>>>>>>>>>>>>> table" and concludes that they don't have any advantage for the 
>>>>>>>>>>>>> same set of
>>>>>>>>>>>>> reasons. But those reasons are not really convincing to me so 
>>>>>>>>>>>>> let's talk
>>>>>>>>>>>>> about them in more detail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and
>>>>>>>>>>>>> table is advantageous" as "this would cause unnecessary 
>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>> the view and table in catalogs."  What dependency exactly do you 
>>>>>>>>>>>>> mean here?
>>>>>>>>>>>>> And why is that unnecessary, given there has to be some sort of 
>>>>>>>>>>>>> dependency
>>>>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (2) You said "I guess there’s an argument that you could load
>>>>>>>>>>>>> both table and view metadata locations at the same time. That 
>>>>>>>>>>

Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
Sounds good to me, can you start a document then, and we can all contribute
there?

On Fri, Mar 22, 2024 at 10:47 AM Walaa Eldin Moustafa 
wrote:

> Let us list the pros and cons as originally planned. I can help as well if
> needed. We can get started and have Jack chime in when he is back?
>
> On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho 
> wrote:
>
>> Hi
>>
>> My understanding was last time it was still unresolved, and the action
>> item was on Jack and/or/ Jan to make a shorter document.  I think the
>> debate now has boiled down to Ryan's three options:
>>
>>1. separate table/view
>>2. combination of table/view tied together via commit
>>3. new metadata type
>>
>>  with probably the first and third being the main contenders. My
>> understanding was we wanted a table of pros/cons between (1) and (3),
>> presumably giving folks a chance to address the cons, before the next
>> meeting.
>>
>> Jack (main proponent of option (3) just went on paternity leave, so not
>> sure if there was someone from Amazon with some context of Jack's thought
>> to continue that train of thought though?  Otherwise maybe Jan can give it
>> a shot?  Else I will be out and can't make the next iceberg sync, but can
>> prepare one for the one after that, if needed.
>>
>> Re: 'new' proposal', not sure if we are ready for a formal one, given the
>> deadlock between the two options, but Im open to that as well to make a
>> proposal based on one of the options above.  What do folks think?
>>
>> Thanks,
>> Szehon
>>
>> On Fri, Mar 22, 2024 at 3:15 AM Renjie Liu 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Hi Renjie,
>>>>
>>>> We discussed the MV proposal, without yet reaching any conclusion.
>>>>
>>>> I propose:
>>>> - to use the "new" proposal process in place (creating an GH issue with
>>>> proposal flag, with link to the document)
>>>> - use the document and/or GH issue to add comments
>>>> - finalize the document heading to a vote (to get consensus)
>>>>
>>>> Thoughts ?
>>>>
>>>> NB: I will follow up with "stale PR/proposal" PR to be sure we are
>>>> moving forward ;)
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu 
>>>> wrote:
>>>>
>>>>> Hi:
>>>>>
>>>>> Sorry I didn't make it to join the last community sync. Did we reach
>>>>> any conclusion about mv spec?
>>>>>
>>>>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal  wrote:
>>>>>
>>>>>> For me the calendar link did not work in mobile, but I was able to
>>>>>> add the dev Google calendar from
>>>>>> https://iceberg.apache.org/community/#iceberg-community-events by
>>>>>> accessing it from  laptop.
>>>>>>
>>>>>> Regards,
>>>>>> Himadri Pal
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Jack! I think the images are stripped from the message, but
>>>>>>> they are there on the doc
>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>>  if
>>>>>>> someone wants to check them out (I have left some comments while there).
>>>>>>>
>>>>>>> Also I no longer see the community sync calendar
>>>>>>> https://iceberg.apache.org/community/#slack, so it is unclear when
>>>>>>> the meeting is (and we do not have the link).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye  wrote:
>>>>>>>
>>>>>>>> Thanks Jan! +1 for everyone to take a look before the discussion,
>>>>>>>> and see if there are any missing options or major arguments.
>>>>>>>>
>>>>>>>> I have also added the images regarding all the options, it 

Re: [VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-22 Thread Szehon Ho
+1 (binding)

* Verify signature
* Verify checksum
* Verify licenses
* Build and run basic test with Spark 3.5

Thanks
Szehon

On Sun, Apr 21, 2024 at 11:45 PM Ajantha Bhat  wrote:

> +1 (non-binding)
>
> * validated checksum and signature
> * checked license docs & ran RAT checks
> * ran build and tests with JDK11
>
> - Ajantha
>
> On Mon, Apr 22, 2024 at 2:49 AM Hussein Awala  wrote:
>
>> +1 (non-binding)
>> - checked signatures, checksums and licences
>> - tested with Spark 3.5.1 and Glue and Hive catalogs
>>
>> On Sunday, April 21, 2024, Jean-Baptiste Onofré  wrote:
>>
>>> +1 (non binding)
>>>
>>> I checked the fixes on JDBC Catzlog.
>>>
>>> Regards
>>> JB
>>>
>>> Le ven. 19 avr. 2024 à 01:07, Amogh Jahagirdar  a
>>> écrit :
>>>
 Hi Everyone,

 I propose that we release the following RC as the official Apache
 Iceberg 1.5.1 release.

 The commit ID is cbb853073e681b4075d7c8707610dceecbee3a82
 * This corresponds to the tag: apache-iceberg-1.5.1-rc0
 * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.1-rc0
 *
 https://github.com/apache/iceberg/tree/cbb853073e681b4075d7c8707610dceecbee3a82

 The release tarball, signature, and checksums are here:
 *
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.1-rc0

 You can find the KEYS file here:
 * https://dist.apache.org/repos/dist/dev/iceberg/KEYS

 Convenience binary artifacts are staged on Nexus. The Maven repository
 URL is:
 * https://repository.apache.org/
 
 content/repositories/
 
 orgapacheiceberg-1162/
 

 Please download, verify, and test.

 Please vote in the next 72 hours.

 [ ] +1 Release this as Apache Iceberg 1.5.1
 [ ] +0
 [ ] -1 Do not release this because...

 Only PMC members have binding votes, but other community members are
 encouraged to cast
 non-binding votes. This vote will pass if there are 3 binding +1 votes
 and more binding
 +1 votes than -1 votes.

>>>


Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-22 Thread Szehon Ho
+1 for the approach given it reduces the work.  On this, as it exposes
storage tables to user catalog, I was mainly thinking we should have a
common suffix/naming pattern for storage table across catalog.  The netflix
approach sounds good to me.

Hope we can continue the proposal, as there's still decisions on how to
standardize other metadata like how MV lineages.

Thanks,
Szehon

On Fri, Apr 19, 2024 at 6:17 PM John Zhuge  wrote:

> +1 on separate view and table metadata
>
> I'd like to share our experience of such a design at Netflix for years.
> The changes to the view spec are minimal and there are no changes to the
> Iceberg table metadata other than tracking an additional table property for
> capturing freshness. The storage tables have a specific suffix and a naming
> pattern. It is convenient to use existing toolings on these tables. We have
> not encountered any fundamental issues with this modeling.
>
> On Fri, Apr 19, 2024 at 5:49 AM Renjie Liu 
> wrote:
>
>> +1 for this proposal.
>>
>> On Fri, Apr 19, 2024 at 3:40 PM Ajantha Bhat 
>> wrote:
>>
>>> +1 for the proposal.
>>>
>>> - Ajantha
>>>
>>> On Fri, Apr 19, 2024 at 7:29 AM Benny Chow  wrote:
>>>
 +1 for separate view and table objects.  Walaa's Spark
 implementation demonstrates how little change it takes on the Iceberg APIs
 to start sharing MVs between engines.

 Thanks
 Benny

 On Thu, Apr 18, 2024 at 9:52 AM Walaa Eldin Moustafa <
 wa.moust...@gmail.com> wrote:

> Hi everyone,
>
> I would like to make a proposal for issue [1] to support materialized
> views in Iceberg. The support leverages two separate objects, an Iceberg
> view and an Iceberg table to implement materialized views. Each object
> retains relevant metadata to support the MV operations. An initial design,
> which we can refine, is detailed in the description section of this PR 
> [2].
>
> This proposal is the outcome of extensive community discussions in
> various forums [3, 4, 5, 6, 7].
>
> Please respond with your recommendation:
> +1 if you support moving forward with the two separate objects model.
> 0 if you are neutral.
> -1 if you disagree with the two separate objects model.
>
> Thanks,
> Walaa.
>
> [1] https://github.com/apache/iceberg/issues/10043
> [2] https://github.com/apache/iceberg/pull/9830
> [3]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY
> [4] https://github.com/apache/iceberg/issues/6420
> [5]
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF
> [6] https://lists.apache.org/thread/tb3wcs7czjvjbq9y1qtr87g9s95ky5zh
> [7] https://lists.apache.org/thread/l6cvrp4r1001k08cy2ypybzy2kgxpt1y
>

>
> --
> John Zhuge
>


[Discuss] Geospatial Support

2024-05-01 Thread Szehon Ho
Hi everyone,

We have created a formal proposal for adding Geospatial support to Iceberg.

Please read the following for details.

   - Github Proposal : https://github.com/apache/iceberg/issues/10260
   - Proposal Doc:
   
https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI


Note that this proposal is built on existing extensive research and POC
implementations (Geolake, Havasu).  Special thanks to Jia Yu and Kristin
Cowalcijk from Wherobots/Geolake for extensive consultation and help in
writing this proposal, as well as support from Yuanyuan Zhang from Geolake.

We would love to get more feedback for this proposal from the wider
community and eventually discuss this in a community sync.

Thanks
Szehon


Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
Thanks Walaa for driving it forward, looking forward to thinking about
implementation of Materialized Views.

I see Jan's point, the PR spec change is similar but does not seem to be
completely aligned with the Draft Spec in the design doc:
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
.  I left my comments on PR of those sections with the links to the
difference.  I think most of those Draft Spec proposal is still applicable
after the decision to have separate Table and View objects  It will be
interesting to at least see drill a bit further why we did not choose the
approach in the Draft Spec and chose another way.

Thanks
Szehon

On Wed, May 8, 2024 at 4:48 AM Jan Kaul  wrote:

> Well, everybody that actively contributed to the discussion on the
> original google doc was in consensus. That's why I brought up the topic at
> the Community Sync on the 2024-02-14 (https://youtu.be/uAQVGd5zV4I?t=890)
> to raise the awareness of the broader community. After which the discussion
> about the storage model started. I don't think that the discussion about a
> single aspect of a proposal should invalidate all other aspects of the
> proposal.
>
> Regardless, the state of the proposal from the original google doc
> contains a lot of valuable contributions from Micah, Szehon, Jack, Dan,
> yourself and others and it should at least provide the basis for any
> further discussion. I don't think it's effective to start with a completely
> different design because we are bound to have the same discussions all over
> again.
>
> Thanks, Jan
> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>
> The only consensus the community had was on the object model through the
> most recent voting thread [1]. This kind of consensus was not present
> during the doc discussions, and this should be evident from the fact the
> last doc state listed 5 alternatives with no particular conclusion. I am
> not quite sure what type of consensus we are referring to here given all
> the follow up discussions, alternatives, etc.
>
> Due to the separate object model, the PR is fundamentally different from
> the doc in the sense it does not propose a new metadata model but rather
> formalizes some new table and view properties related to MVs. That is also
> one reason there are no repeated discussions. That said, if you feel there
> is a repeated discussion (which I do not see so far), it would be best to
> link the relevant discussion from the doc in a comment.
>
> Happy to move the discussion elsewhere if there is sufficient support for
> this idea, but as things stand, I do not see this as an efficient way to
> make progress. It sounds we have been re-emphasizing the same points in the
> last two replies, so I will let others chime in at this point.
>
> [1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>
> Thanks,
> Walaa.
>
>
> On Wed, May 8, 2024 at 2:31 AM Jan Kaul 
>  wrote:
>
>> The original google doc
>> 
>> discussed multiple aspects of the Materialized View spec. One was the
>> storage model while others were related to the metadata. After we (Micah,
>> Szehon, you, me) reached consensus in the google doc, Jack raised his
>> concern about the storage model and the long discussion about the storage
>> model started. Now we truly reached consensus about the storage model,
>> which is now also reflected in the google doc. All other aspects from the
>> google doc about the metadata weren't questioned and still represent the
>> consensus.
>>
>> I would like to *avoid repeating the discussions* in your PR that we
>> already had in the google doc. Especially since we reached consensus which
>> took a considerable amount of time.
>>
>> Thanks, Jan
>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote:
>>
>> Thanks Jan. I think we moved on to more alignment steps beyond that doc a
>> while ago. After that doc, we have discussed the topic further in 2 dev
>> list threads and one more doc
>> 
>> (with strictly two options for the storage model to consider). Moreover,
>> the original doc grew to 14 pages long with one section comparing 5 design
>> alternatives, which made things harder to reach consensus. The lack of
>> consensus is what partly led up to the subsequent discussions and call for
>> a more focused approach to reach consensus. If we already have a consensus
>> on the storage model (separate tables and views), I think we should take
>> things further and have continued focused discussions on the specific
>> metadata in the form of a PR. I have included all previous discussions
>> including the original doc and issue as references in the PR description.
>> Please let me know if this works. Happy to hear others' thoughts on the
>> best way to move forward.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Wed, May 8, 2024 

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
Hi Walaa,

I agree, I definitely do not want yet another pr/doc where discussion
happens. as its already quite spread out :)  But did not want to clarify
some points before we get started on the discussion on your PR.

With reusing the table and view objects, we are not changing the existing
> metadata of either table or view spec but rather introduce new properties
> and formalize them to express materialized views
>

On this point, I am not 100% sure that choosing to represent a
MaterializedView as a separate View + Table object precludes us from adding
to metadata of Table or View as the Draft Spec suggested, though.  I think
this point was discussed in Jan's initial PR with a good point from Ryan:
https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 that
using Table Properties to track lineage is fairly brittle, and having it
formalized in the Iceberg metadata is cleaner, and that was thus the
direction of the Draft Spec in the design doc.  What do people think?

Thanks
Szehon



On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa 
wrote:

> Thanks Szehon.
>
> The reason for the difference is that the proposal in the Google doc is
> based on a new MV model, hence, new metadata fields and a new metadata
> model were being introduced (with types, optionality, etc). With reusing
> the table and view objects, we are not changing the existing metadata of
> either table or view spec but rather introduce new properties and formalize
> them to express materialized views. This would be the answer to most of the
> questions you posted on the PR (besides some naming questions, which I
> think should be straightforward).
>
> With that fundamental difference, we cannot lift and shift what is in the
> doc to any PR. Further, having consensus on separate table and view objects
> contradicts with the point being made on having consensus on the doc. We
> might have had agreements on some elements, but definitely not on the whole
> doc, proven by the follow ups (also as a community, not individuals).
>
> Therefore: we need a new space to discuss the separate table and view
> properties.
>
> Is the question whether to:
> 1- Create a new doc
> 2- Create a new PR?
>
> I feel a PR is the most effective way, especially given the fact that we
> discussed the topic a lot by now. If we agree, we can continue the
> discussion on the PR, else, we can create a doc.
>
> Thanks,
> Walaa.
>
>
> On Thu, May 9, 2024 at 4:39 PM Szehon Ho  wrote:
>
>> Thanks Walaa for driving it forward, looking forward to thinking about
>> implementation of Materialized Views.
>>
>> I see Jan's point, the PR spec change is similar but does not seem to be
>> completely aligned with the Draft Spec in the design doc:
>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>> .  I left my comments on PR of those sections with the links to the
>> difference.  I think most of those Draft Spec proposal is still applicable
>> after the decision to have separate Table and View objects  It will be
>> interesting to at least see drill a bit further why we did not choose the
>> approach in the Draft Spec and chose another way.
>>
>> Thanks
>> Szehon
>>
>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul 
>> wrote:
>>
>>> Well, everybody that actively contributed to the discussion on the
>>> original google doc was in consensus. That's why I brought up the topic at
>>> the Community Sync on the 2024-02-14 (https://youtu.be/uAQVGd5zV4I?t=890)
>>> to raise the awareness of the broader community. After which the discussion
>>> about the storage model started. I don't think that the discussion about a
>>> single aspect of a proposal should invalidate all other aspects of the
>>> proposal.
>>>
>>> Regardless, the state of the proposal from the original google doc
>>> contains a lot of valuable contributions from Micah, Szehon, Jack, Dan,
>>> yourself and others and it should at least provide the basis for any
>>> further discussion. I don't think it's effective to start with a completely
>>> different design because we are bound to have the same discussions all over
>>> again.
>>>
>>> Thanks, Jan
>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>
>>> The only consensus the community had was on the object model through the
>>> most recent voting thread [1]. This kind of consensus was not present
>>> during the doc discussions, and this should be evident from the fact the
>>> last doc state listed 5 alternatives with no particular conclusion. I am
>>> not quite sure what type of consen

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
Hi Walaa

As there may be confusion in the word 'properties', I want to double check
if we are talking about the same thing here.

I am reading your PR as adding lineage metadata as new key/value pair under
the storage Table's 'properties' field:
https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677

*optional* *optional* *properties* A string to string map of table
properties. This is used to control settings that affect reading and
writing and is not intended to be used for arbitrary metadata. For example,
commit.retry.num-retries is used to control the number of commit retries.
and adding Storage Table pointer as key/value pair in the View's
'properties' field:
https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65

*optional* properties A string to string map of view properties [2]
Is that correct?

On the other hand, I was talking about adding this metadata as actual
fields, as is described in the Draft Spec of the Design Doc
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
and
first PR https://github.com/apache/iceberg/issues/6420 .

Do you mean, the vote means we cannot model new fields like
'materialization' and 'lineage' as was proposed there ?If that is the
interpretation, I am not sure I agree.  I dont fully see how new fields
adds more catalog implementation complexity over new key/value properties?
To me, the vote seemed to just rule out using a combined catalog object
(MaterializedView) in favor of re-using the Table and View metadata models,
not to prevent change to the Table and View model.

Thanks
Szehon


On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa 
wrote:

> Hi Szehon,
>
> I think choosing separate view + table objects precludes us from adding
> new metadata to table and view metadata. Here is one relevant comment [1]
> from Ryan on the modeling doc, where his point is that we want to avoid
> introducing new APIs since it requires updating every catalog, and
> (quoting) even now, we have few implementations that support views because
> of the problems updating back ends. Therefore, one of the major reasons to
> avoid a new model with new metadata is to avoid adding new metadata, which
> introduces this complexity. Here is another similar comment from Renjie [2]
> on the cons listed for the combined object approach.
>
> Even Ryan's point on the MV issue that you referenced reads to me as he is
> supportive of the property model. Here are some quotes:
>
> > We would still want some MV metadata in table *properties*.
>
> > I recommend instead reusing the existing snapshot metadata structure to
> store what you need as snapshot *properties*.
>
> > First, I think we want to avoid keeping much state information in
> complex table *properties*.
>
> Again, here, he is supportive of table properties, but wants to make sure
> that the information is simple.
>
> > We may want additional metadata as well, like a UUID to ensure we have
> the right view. I don't think we have a UUID in the view spec yet, but we
> could add one.
>
> Here, he is very specific when it comes to new metadata fields, and
> explicitly calls it out. That is the only new metadata field in that reply
> and by now it is already supported. It is also not MV-specific.
>
> Hope this addresses your question on the property vs new metadata model.
>
> [1]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
> [2]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>
> Thanks,
> Walaa.
>
>
> On Thu, May 9, 2024 at 5:49 PM Szehon Ho  wrote:
>
>> Hi Walaa,
>>
>> I agree, I definitely do not want yet another pr/doc where discussion
>> happens. as its already quite spread out :)  But did not want to clarify
>> some points before we get started on the discussion on your PR.
>>
>> With reusing the table and view objects, we are not changing the existing
>>> metadata of either table or view spec but rather introduce new properties
>>> and formalize them to express materialized views
>>>
>>
>> On this point, I am not 100% sure that choosing to represent a
>> MaterializedView as a separate View + Table object precludes us from adding
>> to metadata of Table or View as the Draft Spec suggested, though.  I think
>> this point was discussed in Jan's initial PR with a good point from Ryan:
>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 that
>> using Table Properties to track lineage is fairly brittle, and having it
>> formalized in the Iceberg metadata is cleaner, and that was t

Re: Materialized Views: Next Steps

2024-05-10 Thread Szehon Ho
Hi Walaa

OK thanks for confirming.  I am still not 100% in agreement, my
understanding of the rationale for separate Table/View objects in the
comment that you linked:

I think the biggest problem with this is that we would need to modify every
> catalog to support this combination and that would be really difficult.


is about JavaCatalogs /REST Catalog needing to to support creating ,
persisting, and loading a MaterializedView object, which is much more
complex.  See HiveView PR for example :
https://github.com/apache/iceberg/pull/9852   We would have to do the same
exercise for persisting MV.

In our case though, there's not much complexity regardless of approach
('properties' or new metadata fields), in terms of Java Catalog/REST
Catalog.  It's mostly pass-through to storage.  Looks like you are
referring to Spark's View model in terms of complexity, which may be a
different story, but not sure if it is a good rationale to make Iceberg to
use 'properties' .

'properties'  is for read/write configurations, not to save metadatas.  To
me, its also brittle to save important metadata, as it's not in the defined
schema.

A string to string map of table properties. This is used to control
> settings that affect reading and writing and is not intended to be used for
> arbitrary metadata.  For example, commit.retry.num-retries is used to
> control the number of commit retries.


On the other hand, the Draft Spec suggests to save `lineage` as a modeled
field on the Storage Table's snapshot metadata.  This allows you to 'time
travel', 'branch', and have this metadata life cycle integrated via normal
snapshots lifecycle operations.

So that's my rationale.  Not sure if we can come to an agreement over email
though, and may need others to chime in as well.

Thanks
Szehon




On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa 
wrote:

> Hi Szehon,
>
> Yes, you are reading the PR correctly, and interpreting the meaning of
> properties correctly. I think the reply you pasted from Ryan refers to the
> same concept as well.
>
> For the initial Google doc and the issue (by the way it is an issue, not a
> PR), yes both are proposing new metadata fields.
>
> The references I made to the modeling doc [1, 2] are reasons why new APIs
> are not desired. The cons/concerns applicable to new MV metadata apply by
> extension to new table and view metadata fields.
>
> The reason why new metadata adds complexity is that this new metadata
> needs to be propagated to the engine API. For example, here is the ViewInfo
> [3] class in the Spark catalog, which is used in view methods like
> createView. Its fields correspond with the Iceberg metadata. Adding new
> Iceberg fields should be accompanied with new fields in the engine
> catalog/connector APIs, which was a major reason for rejecting the combined
> MV object model as well.
>
> [1]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
> [2]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
> [3]
> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>
> Thanks,
> Walaa.
>
> On Thu, May 9, 2024 at 11:30 PM Szehon Ho  wrote:
>
>> Hi Walaa
>>
>> As there may be confusion in the word 'properties', I want to double
>> check if we are talking about the same thing here.
>>
>> I am reading your PR as adding lineage metadata as new key/value pair
>> under the storage Table's 'properties' field:
>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>
>> *optional* *optional* *properties* A string to string map of table
>> properties. This is used to control settings that affect reading and
>> writing and is not intended to be used for arbitrary metadata. For example,
>> commit.retry.num-retries is used to control the number of commit retries.
>> and adding Storage Table pointer as key/value pair in the View's
>> 'properties' field:
>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>
>> *optional* properties A string to string map of view properties [2]
>> Is that correct?
>>
>> On the other hand, I was talking about adding this metadata as actual
>> fields, as is described in the Draft Spec of the Design Doc
>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>  and
>> first PR https://github.com/apache/iceberg/issues/6420 .
>>
>> Do you mean, the vote means we cannot model new fiel

Re: [Discuss] Heap pressure with RewriteFiles APIs

2024-05-21 Thread Szehon Ho
Hi Naveen

Yes it sounds like it will help to disable metrics for those columns?
Iirc, by default it manifest entries have metrics at 'truncate(16)' level
for 100 columns, which as you see can be quite memory intensive.  A
potential improvement later also is to have the ability to remove counts by
config, though need to confirm if that is feasible.

Unfortunately today the new metrics config will only apply to new data
files  (you have to rewrite them all, or otherwise phase old data files
out).  I had a patch awhile back to add support for rewriting just manifest
with new metric config but was not merged yet, if any reviewer has time to
review, I can work on it again.
https://github.com/apache/iceberg/pull/2608


Thanks
Szehon

On Tue, May 21, 2024 at 1:43 AM Naveen Kumar  wrote:

> Hi Everyone,
>
> I am looking into RewriteFiles
> 
> APIs and its implementation BaseRewriteFiles
> .
> Currently this works as following:
>
>1. It accumulates all the files for addition and deletions.
>2. At time of commit, it creates a new snapshot after adding all the
>entries to corresponding manifest files.
>
> It has been observed that if the accumulated file objects are of huge size
> it takes a lot of memory.
> *eg*: Each dataFile object is of size *1KB*. Total accumulated(additions
> or deletions) size is *1 million. *
> Total memory consumed by *RewriteFiles* will be around *1GB*.
>
> Such dataset can happen with following reasons:
>
>1. Table is very wide with say 1000 columns.
>2. Most of the columns are of String data type, which can take more
>space to store lower bound and upper bound.
>3. Table has billions of records with millions of data files.
>4. It is running data compaction procedures/jobs for the first time.
>5. Or, Table was UN-partitioned and later evolved by new partition
>columns.
>6. Now it is trying to compact the table
>
> Attaching heap dump from one of the dataset while using API
>
>> RewriteFiles rewriteFiles(
>> Set removedDataFiles,
>> Set removedDeleteFiles,
>> Set addedDataFiles,
>> Set addedDeleteFiles)
>>
>>
> [image: Screenshot 2024-01-11 at 10.01.54 PM.png]
> We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT
> ,
> which helps create smaller groups and multiple commits with configuration
> PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
> .
> Currently engines like SPARK can follow this strategy. Since SPARK is
> running all the compaction jobs concurrently, there are chances many jobs
> can land on the same machines and accumulate with high memory usage.
>
> My question is, can we make these implementations
> better
> to avoid any heap pressure? Also, has someone encountered similar issues
> and if so how did they fix it?
>
> Regards,
> Naveen Kumar
>
>


Re: [Discuss] Geospatial Support

2024-05-29 Thread Szehon Ho
o be more active in the Iceberg community, we’ve been looking over
> this geospatial proposal. We’re excited geospatial is getting traction, as
> we see a lot of geo usage within Snowflake, and expect that usage to carry
> over to our Iceberg offerings soon. After reviewing the proposal, we have
> some questions we’d like to pose given our experience with geospatial
> support in Snowflake.
>
> We would like to clarify two aspects of the proposal: handling of the
> spherical model and definition of the spatial reference system. Both of
> which have a big impact on the interoperability with Snowflake and other
> query engines and Geo processing systems.
>
>
> Let us first share some context about geospatial types at Snowflake; geo
> experts will certainly be familiar with this context already, but for the
> sake of others we want to err on the side of being explicit and clear.
> Snowflake supports two Geospatial types [1]:
> - Geography – uses a spherical approximation of the earth for all the
> computations. It does not perfectly represent the earth, but allows getting
> accurate results on WGS84 coordinates, used by GPS without any need to
> perform coordinate system reprojections. It is also quite fast for
> end-to-end computations. In general, it has less distortions compared to
> the 2d planar model .
> - Geometry – uses planar Euclidean geometry model. Geometric computations
> are simpler, but require transforming the data between coordinate systems
> to minimize the distortion. The Geometry data type allows setting a spatial
> reference system for each row using the SRID. The binary geospatial
> functions are only allowed on the geometries with the same SRID. The only
> function that interprets SRID is ST_TRANFORM that allows conversion between
> different SRSs.
>
> Geography
>
> Geometry
>
>
>
> Given the choice of two types and a set of operations on top of them, the
> majority of Snowflake users select the Geography type to represent their
> geospatial data.
>
> From our perspective, Iceberg users would benefit most from being given
> the flexibility to store and process data using the model that better fits
> their needs and specific use cases.
>
> Therefore, we would like to ask some design clarifying questions,
> important for interoperability:
>
>
> 1. In the first version of the specification Phase1 it is mentioned as the
> version focused on the planar geometry model with a CRS system fixed on
> 4326. In this model, Snowflake would not be able to map our Geography type
> since it is based on the spherical Geography model. Given that Snowflake
> supports both edge types, we would like to better understand how to map
> them to the proposed Geometry type and its metadata.
>
>-
>
>How is the edge type supposed to be interpreted by the query engine?
>Is it necessary for the system to adhere to the edge model for geospatial
>functions, or can it use the model that it supports or let the customer
>choose it? Will it affect the bounding box or other row group metadata
>-
>
>Is there any reason why the flexible model has to be postponed to
>further iterations? Would it be more extensible to support mutable edge
>type from the Phase 1, but allow systems to ignore it if they do not
>support the spherical computation model
>
>
>
> 2. As you mentioned [2] in the proposal there are difficulties with
> supporting the full PROJSSON specification of the SRS. From our experience
> most of the use-cases do not require the full definition of the SRS, in
> fact that definition is only needed when converting between coordinate
> systems. On the other hand, it’s often needed to check whether two geometry
> columns have the same coordinate system, for example when joining two
> columns from different data providers.
>
> To address this we would like to propose including the option to specify
> the SRS with only a SRID in phase 1. The query engine may choose to treat
> it as opaque identified or make a look-up in the EPSG database of
> supported.
>
> Thank you again for driving this effort forward. We look forward to
> hearing your thoughts.
>
> [1]
> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>
> [2]
> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>
>
> On 2024/05/02 00:41:52 Szehon Ho wrote:
> > Hi everyone,
> >
> > We have created a formal proposal for adding Geospatial support to
> Iceberg.
> >
> > Please read the following for details.
> >
> >- Github Proposal : https://github.com/apache/iceberg/issues

Re: [Discuss] Geospatial Support

2024-06-05 Thread Szehon Ho
Hi Peter

Yes the document only concerns the predicate pushdown of geometric column.
Predicate pushdown takes two forms, 1) partition filter and 2) min/max
stats.  The min/max stats are discussed in the doc (Phase 2), depending on
the non-trivial encoding.

The evaluators are always AND'ed together, so I dont see any issue of
partitioning with another key not working on a table with a geo column.

On another note, Jia and I thought that we may have a discussion about
Snowflake geo types in a call to drill down on some details?  What time
zone are you folks in/ what time works better ?  I think Jia and I are both
in Pacific time zone.

Thanks
Szehon

On Wed, Jun 5, 2024 at 1:02 AM Peter Popov 
wrote:

> Hi Szehon, hi Jia,
>
> Thank you for your replies. We now better understand the connection
> between the metadata and partitioning in this proposal. Supporting the
> Mapping 1 is a great starting point, and we would like to work closer with
> you on bringing the support for spherical edges and other coordinate
> systems into Iceberg geometry.
>
> We have some follow-up questions regarding the partitioning (let us know
> if it’s better to comment directly in the document): Does this proposal
> imply that XZ2 partitioning is always required? In the current proposal,
> do you see a possibility of predicate pushdown to rely on x/y min/max
> column metadata instead of a partition key? We see use-cases where a table
> with a geo column can be partitioned by a different key(e.g. date) or
> combination of keys. It would be great to support such use cases from the
> very beginning.
>
> Thanks,
>
> Peter
>
> On Thu, May 30, 2024 at 8:07 AM Jia Yu  wrote:
>
>> Hi Dmtro,
>>
>> Thanks for your email. To add to Szehon's answer,
>>
>> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
>> given the Geo Iceberg Phase 1 design:
>>
>> Answer:
>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
>> Geometry + CRS84 + edges: Planar
>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + CRS84 +
>> edges: Spherical
>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg
>> Geometry + SRID:ABCDE + edges: Planar
>>
>> As Szehon mentioned, only Mapping 1 is possible because we need to
>> support spatial query push down in Iceberg. This function relies on the
>> Iceberg partition transform, which requires a 1:1 mapping between a value
>> (point/polygon/linestring) and a partition key. That is: given any
>> precision level, a polygon must produce a single ID; and the covering
>> indicated by this single ID must fully cover the extent of the polygon.
>> Currently, only xz2 can satisfy this requirement. If the theory from
>> Michael Entin can be proven to be correct, then we can support Mapping 2 in
>> Phase 2 of Geo Iceberg.
>>
>> Regarding Mapping 3, this requires Iceberg to be able to understand SRID
>> / PROJJSON such that we will know min max X Y of the CRS (@Szehon, maybe
>> Iceberg can ask the engine to provide this information?). See my answer 2.
>>
>> 2. Why choose projjson instead of SRID?
>>
>> The projjson idea was borrowed from GeoParquet because we'd like to
>> enable possible conversion between Geo Iceberg and GeoParquet. However, I
>> do understand that this is not a good idea for Iceberg since not many libs
>> can parse projjson.
>>
>> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo
>> Iceberg?
>>
>> It is also worth noting that, although there are many libs that can parse
>> SRID and perform look-up in the EPSG database, the license of the EPSG
>> database is NOT compatible with the Apache Software Foundation. That means:
>> Iceberg still cannot parse / understand SRID.
>>
>> Thanks,
>> Jia
>>
>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho 
>> wrote:
>>
>>> Hi Dmytro
>>>
>>> Thank you for looking through the proposal and excited to hear from you
>>> guys!  I am not a 'geo expert' and I will definitely need to pull in Jia Yu
>>> for some of these points.
>>>
>>> Although most calculations are done on the query engine, Iceberg
>>> reference implementations (ie, Java, Python) does have to support a few
>>> calculations to handle filter push down:
>>>
>>>1. push down of the proposed Geospatial transforms ST_COVERS,
>>>ST_COVERED_BY, and ST_INTERSECTS
>>>2. evaluation of proposed Geospatial partition transform XZ2.  As
>>>you may have seen, this was chosen as its the only standard one today 
>>&

Re: [Discuss] Geospatial Support

2024-06-18 Thread Szehon Ho
Jia and I will sync with the Snowflake folks to see if we can have a
solution, or roadmap to solution, in the proposal.

Thanks JB for the interest!  By the way, I want to schedule a meeting to go
over the proposal, it seems there's good feedback from folks from geo side
(and even Parquet community), but not too many eyes/feedback from other
folks/PMC on Iceberg community.  This might be due to lack of familiarity/
time to read through it all.  In fact, a lot of the advanced discussions
like this one are for Phase 2 items, and Phase 1 items are relatively
straightforward, so wanted to explain that.  As I know its summer vacation
for some folks, we can do this in a week or early July, hope that sounds
good with everyone.

Thanks,
Szehon

On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré 
wrote:

> Hi Jia
>
> Thanks for the update. I'm gonna re-read the whole thread and document to
> have a better understanding.
>
> Thanks !
> Regards
> JB
>
> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu  wrote:
>
>> Hi Snowflake folks,
>>
>> Please let me know if you have other questions regarding the proposal. If
>> any, Szehon and I can set up a zoom call with you guys to clarify some
>> details. We are in the Pacific time zone. If you are in Europe, maybe early
>> morning Pacific Time works best for you?
>>
>> Thanks,
>> Jia
>>
>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu  wrote:
>>
>>> > The min/max stats are discussed in the doc (Phase 2), depending on the
>>> non-trivial encoding.
>>>
>>> Just want to add that min/max stats filtering could be supported by file
>>> format natively. Adding geometry type to parquet spec
>>> is under discussion: https://github.com/apache/parquet-format/pull/240
>>>
>>> Best,
>>> Gang
>>>
>>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho 
>>> wrote:
>>>
>>>> Hi Peter
>>>>
>>>> Yes the document only concerns the predicate pushdown of geometric
>>>> column.  Predicate pushdown takes two forms, 1) partition filter and 2)
>>>> min/max stats.  The min/max stats are discussed in the doc (Phase 2),
>>>> depending on the non-trivial encoding.
>>>>
>>>> The evaluators are always AND'ed together, so I dont see any issue of
>>>> partitioning with another key not working on a table with a geo column.
>>>>
>>>> On another note, Jia and I thought that we may have a discussion about
>>>> Snowflake geo types in a call to drill down on some details?  What time
>>>> zone are you folks in/ what time works better ?  I think Jia and I are both
>>>> in Pacific time zone.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov 
>>>> wrote:
>>>>
>>>>> Hi Szehon, hi Jia,
>>>>>
>>>>> Thank you for your replies. We now better understand the connection
>>>>> between the metadata and partitioning in this proposal. Supporting the
>>>>> Mapping 1 is a great starting point, and we would like to work closer with
>>>>> you on bringing the support for spherical edges and other coordinate
>>>>> systems into Iceberg geometry.
>>>>>
>>>>> We have some follow-up questions regarding the partitioning (let us
>>>>> know if it’s better to comment directly in the document): Does this
>>>>> proposal imply that XZ2 partitioning is always required? In the
>>>>> current proposal, do you see a possibility of predicate pushdown to
>>>>> rely on x/y min/max column metadata instead of a partition key? We see
>>>>> use-cases where a table with a geo column can be partitioned by a 
>>>>> different
>>>>> key(e.g. date) or combination of keys. It would be great to support such
>>>>> use cases from the very beginning.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Peter
>>>>>
>>>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu  wrote:
>>>>>
>>>>>> Hi Dmtro,
>>>>>>
>>>>>> Thanks for your email. To add to Szehon's answer,
>>>>>>
>>>>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
>>>>>> given the Geo Iceberg Phase 1 design:
>>>>>>
>>>>>> Answer:
>>>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
>

Re: Agenda Community Sync 19th June

2024-06-18 Thread Szehon Ho
Hi guys,

The sync is Juneteenth (US federal holiday), so I think some folks on this
side may miss, FYI

PS (at least from my side) one highlight is the longstanding 1k column bug
is finally fixed (at least partially) in
https://github.com/apache/iceberg/pull/10020

Thanks
Szehon

On Tue, Jun 18, 2024 at 7:17 PM Xuanwo  wrote:

> No active response about renjie's rust 0.3 release discussion. Let's add
> it as an entry of this community sync.
>
> On Wed, Jun 19, 2024, at 03:18, Fokko Driesprong wrote:
>
> Hey Jan,
>
> Thanks for raising this. Let me jot down the highlights, and feel free to
> add what you'd like to discuss. I'm personally looking forward to an update
> on the materialized views.
>
> Kind regards,
> Fokko
>
> Op di 18 jun 2024 om 20:28 schreef Jan Kaul :
>
> Hi all,
>
> I was wondering whether there was an agenda for the community sync
> tomorrow. There currently is no entry in the google doc.
>
> Best wishes,
>
> Jan
>
> Xuanwo
>
>


Re: Making the NDV property required for theta sketch blobs in Puffin

2024-06-21 Thread Szehon Ho
It makes sense to me, normally changing optional -> required would probably
require a version bump, but maybe it is ok here as it is a relatively new
format, afaik adapted by Trino which already sets this field, but let's see
if anyone disagrees.

Thanks
Szehon

On Fri, Jun 21, 2024 at 3:35 PM huaxin gao  wrote:

> +1 for making the ndv blob metadata property required for theta sketches.
>
> On Fri, Jun 21, 2024 at 2:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> Hey all,
>>
>> I wanted to raise this thread to discuss a spec change proposal
>>  for making the ndv blob
>> metadata property required for theta sketches. Currently, the spec is a bit
>> loose stating:
>>
>> The blob metadata for this blob *may* include following properties:
>>
>>- ndv: estimate of number of distinct values, derived from the sketch
>>
>>
>> This came up on this PR
>>  
>> where
>> it came up that engines like Presto/Trino are using the property as a
>> source of truth and the implementation of the Spark procedure in the PR
>> originally was deriving the NDV from the sketch itself. It's currently
>> unclear what engine integrations should use as a source of truth.
>>
>> The main advantage of having it in the properties is that engines don't
>> have to go and deserialize the sketch/compute the NDV if they just want the
>> NDV (putting aside the intersection/union case where I think engines would
>> have to read the sketch). I think this makes it easier for engine
>> integration. The spec also currently makes it clear that the property must
>> be derived from the sketch so I don't think there's a "source of truth"
>> sync concern. It also should be easy for blob writers to set this property
>> since they'd anyways be populating the sketch in the first place.
>>
>> An alternative is to attempt to read the property and fallback to the
>> sketch (maybe abstract this behind an API) but this loses the advantage of
>> guaranteeing that engines don't have to read the sketch.
>>
>> The spec change to make the property required seems to be the consensus
>> on the PR thread but I wanted to bring it up here in case others had
>> different ideas or if I'm missing any problems with this approach!
>>
>>
>> Thank you,
>>
>> Amogh Jahagirdar
>>
>


Re: Feedback Collection: Bylaws in Iceberg

2024-06-24 Thread Szehon Ho
Hi

Also copying my previous response in private.

Hi
> Thanks Jack for taking the time for this doc.  While the Iceberg community
> and PMC so far has been one of the most collaborative, and I have
> personally the utmost respect for those that laid the groundwork without
> which we would not be here, the project is indeed growing faster than many
> anticipated.  And adding bylaw will definitely give more transparency, if
> that is a concern for some users for adoption, so all in all I'm in favor
> of this.
>


Re: emeritus, it seems a common concept in the Hadoop projects, for example
> Hive:  https://cwiki.apache.org/confluence/display/Hive/Bylaws.  It is
> sometimes used voluntarily by inactive PMC's.  It seems useful in cases
> where you need a majority of PMC's (in the long term many PMC's do move on
> and become inactive, and emeritus would mark them out of the vote count
> iiuc).  However, in Hive there is no need to vote to reinstate emeritus
> PMC's, a simple email to private suffices, I think that would be simpler
> for this project too.
> Thanks
> Szehon


This was in response to the discussion on emeritus, looks like Jack already
took this into account in the latest proposal, so it is ok with me.  I'm
still for tracking emeritus status, as in the long run more PMC's naturally
become inactive and it is harder to pass a majority vote.

In general, I'm still in favor of the bylaws to bring clarity; it looks
like we now need to drill into which ones folks have problems with, looks
like 'Remove committer/PMC' may require a callout for special attention as
well, in the initial email's key points.

Thanks
Szehon

On Mon, Jun 24, 2024 at 2:24 PM Fokko Driesprong  wrote:

> Hey everyone,
>
> Thanks Jack for setting this up, and everyone for their feedback so far.
> Sharing my exact response from the private@:
>
> Hey Jack,
>>
>> Thanks for raising this, and favor of having a bylaws where we can
>> formally adopt ways of working that are specific to the Iceberg project.
>> For example, we have the community guidelines
>>  that could
>> be adopted into the bylaws.
>>
>> Putting up my ASF-member hat
>> . I've read through
>> the first iteration of the bylaws, and I would suggest removing a couple of
>> things from there:
>>
>>- The ASF already defines a subset of the votes
>> that
>>are written down in the document. For example, the doc defines a release 
>> as
>>a lazy majority, but according to the ASF docs
>>, it
>>is a majority approval. I strongly feel that we should not diverge this.
>>There are already docs on voting in new committers and PMCs
>>. I would rather
>>reference those instead of copying them.
>>- I do not see the point of the Emeritus status, both for committers
>>and PMC. It seems to me a lot of work to keep track of it, and according 
>> to
>>the doc, the emeritus keeps the full privileges.
>>
>> Before publishing the doc on the public mailing list, I would suggest
>> removing at least those two to avoid a lot of noise in the discussion. I
>> would also suggest starting minimal and then adopting things one by one
>> (for example, the committer and PMC criteria would be a good one).
>>
>> I have more comments, but I don't want to go down the rabbit hole right
>> away. Thanks again for raising this, and curious what others think.
>>
>> Kind regards,
>> Fokko
>>
>
> As mentioned in my initial response, I think there is value in the bylaws,
> but I'm a firm believer in people over process (community over code?). I'll
> go over the Google-doc tomorrow morning in detail.
>
> Kind regards,
> Fokko Driesprong
>
>
> Op ma 24 jun 2024 om 21:20 schreef Ryan Blue :
>
>> Here is my original email from the thread on the private list. It echoes
>> Carl's suggestion in point 5, that we should focus on adopting bylaws that
>> solve challenges that we are facing in this community, rather than adopting
>> bylaws en masse or from another community with different concerns.
>>
>> Original email:
>>
>> I have no major objections to adding bylaws and codifying how the
>> community operates. And most of the bylaws that are in the doc seem
>> reasonable enough — if we choose to adopt them.
>>
>> What concerns me is adding a big list of rules and changing how the
>> community operates so abruptly.
>>
>> That concern has two aspects. First, when rules are introduced people
>> tend to focus on them and apply them mechanically; I think that will hinder
>> this community, which has so far run on a high degree of trust and social
>> capital.
>>
>> A concrete example is that while you [Jack] were out on leave, we wanted
>> to make progress on the materialized view spec and I push

Re: [Discuss] Geospatial Support

2024-06-26 Thread Szehon Ho
Hi

It was great to meet in person with Snowflake engineers and we had a good
discussion on the paths forward.

Meeting notes for Snowflake- Iceberg sync.

   - Iceberg proposed Geometry type defaults to (edges=planar , crs=CRS84).
   - Snowflake has two types Geography (spherical) and Geometry (planar,
   with customizable CRS).  The data layout/encoding is the same for both
   types.  Let's see how we can support each in Iceberg type, especially wrt
   Iceberg partition/file pruning
   - Geography type support
   - Main concern is the need for a suitable partition transform for
  partition-level filter, the candidate is Micahel Entin's proposal
  
<https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
  .
  - Secondary concern is file and RG-level filtering.  Gang's Parquet
  proposal <https://github.com/apache/parquet-format/pull/240/files> allow
  storage of S2 / H3 ID's in Parquet stats, and so we can also
leverage that
  in Iceberg pruning code (Google and Uber libraries are compatible)
   - Geometry type support
  -  Main concern is partition transform needs to understand CRS, but
  this can be solved by having XZ2 transform created with customizable
  min/max lat/long range (its all it needs)
   - Should (CRS, edges) be stored properties on Geography type in Phase 1?
  - Should be fine to store, with only allowing defaults in Phase 1.
  - Concern 1: If edges is stored, there will be ask to store other
  properties like (orientation, epoch).  Solution is to punt these
follow-on
  properties for later.
  - Concern 2: if crs is stored, what format?  PROJJSON vs SRID.
  Solution is to leave it as a string
  - Concern 3: if crs is stored as a string, Iceberg cannot read it.
  This should be ok, as we only need this for XZ2 transform, where the user
  already passes in the info from CRS (up to user to make sure
these align).

Thanks
Szehon

On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho  wrote:

> Jia and I will sync with the Snowflake folks to see if we can have a
> solution, or roadmap to solution, in the proposal.
>
> Thanks JB for the interest!  By the way, I want to schedule a meeting to
> go over the proposal, it seems there's good feedback from folks from geo
> side (and even Parquet community), but not too many eyes/feedback from
> other folks/PMC on Iceberg community.  This might be due to lack of
> familiarity/ time to read through it all.  In fact, a lot of the advanced
> discussions like this one are for Phase 2 items, and Phase 1 items are
> relatively straightforward, so wanted to explain that.  As I know its
> summer vacation for some folks, we can do this in a week or early July,
> hope that sounds good with everyone.
>
> Thanks,
> Szehon
>
> On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Jia
>>
>> Thanks for the update. I'm gonna re-read the whole thread and document to
>> have a better understanding.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu  wrote:
>>
>>> Hi Snowflake folks,
>>>
>>> Please let me know if you have other questions regarding the proposal.
>>> If any, Szehon and I can set up a zoom call with you guys to clarify some
>>> details. We are in the Pacific time zone. If you are in Europe, maybe early
>>> morning Pacific Time works best for you?
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu  wrote:
>>>
>>>> > The min/max stats are discussed in the doc (Phase 2), depending on
>>>> the non-trivial encoding.
>>>>
>>>> Just want to add that min/max stats filtering could be supported by
>>>> file format natively. Adding geometry type to parquet spec
>>>> is under discussion: https://github.com/apache/parquet-format/pull/240
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho 
>>>> wrote:
>>>>
>>>>> Hi Peter
>>>>>
>>>>> Yes the document only concerns the predicate pushdown of geometric
>>>>> column.  Predicate pushdown takes two forms, 1) partition filter and 2)
>>>>> min/max stats.  The min/max stats are discussed in the doc (Phase 2),
>>>>> depending on the non-trivial encoding.
>>>>>
>>>>> The evaluators are always AND'ed together, so I dont see any issue of
>>>>> partitioning with another key not working on a table with a geo column.
>>>>>
>>>>> On another note, Jia and I though

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
Yes, I was chatting with Yufei about this, in the first glance I agree this
would be nice to have.  I always thought that metadata tables are important
enough to spec somewhere, and I think this is a nice place to do it.  There
seems to be some overlap with existing calls (ie, you can get snapshots
from table. and files from proposed Plan API), but it does seem valuable to
get it in one place.

If we can solve the 'big metadata' issue for PrePlan/PlanTable API's, it
sounds like we can re-use the solution for files metadata tables.  I'd
perhaps leave out position_deletes one though, as it's mostly used
internally and seems a bit too 'big' even for this.

I wonder if we can even add an optional endpoint for listing 'removed'
snapshots.   I know it sounds weird, but when looking at metadata tables,
the one question that I got a lot but could not answer is how to find when
a data file is added (or a partition is added).  If the snapshot is expired
then it is no longer possible to trace that history.  Users often expire
snapshots to claw back disk space, but may necessarily want to delete the
snapshot history.  But I believe the REST catalog seems to have an
opportunity in removeSnapshot to preserve the metadata of the old snapshot
(up to some configured time).  So we can query the snapshot metadata even
after it expires, which I feel will be valuable.

Thanks
Szehon


On Wed, Jul 3, 2024 at 3:04 PM Jack Ye  wrote:

> Hi Yufei,
>
> Interesting that we are thinking about similar things. I had this item as
> a part of the roadmap discussion items in the catalog sync meeting, and
> then I removed it before the meeting because I felt it's too early to
> discuss.
>
> My main concern for having server-side metadata tables is how we solve the
> "big metadata" issue. The partitions, manifests, files table can easily
> itself become a big table, and the REST server becomes inefficient in
> retrieving results. It's the same old "HMS is too slow in iterating through
> the partitions" problem. Iceberg kind of solves it by having this
> information in Avro and in storage that can be scanned distributedly, but
> with server-side metadata tables, we are technically re-introducing the
> problem.
>
> Maybe one potential approach is to run those potentially large metadata
> table scans through the PreplanTable and PlanTable APIs. Just a quick
> thought for now, I need to think a bit more about this.
>
> Best,
> Jack Ye
>
>
>
>
>
> On Wed, Jul 3, 2024 at 1:45 PM Yufei Gu  wrote:
>
>> Hi folks,
>>
>> I'd like to discuss a new proposal to support server-side metadata tables.
>>
>> One of Iceberg's most advantageous features is the ability to inspect a
>> table using metadata tables. For instance, we can query snapshots just like
>> we query data rows using the following command: SELECT * FROM
>> prod.db.table.snapshots;
>>
>> With the REST catalog, we can simplify this process further by providing
>> metadata directly from REST endpoints. Here are several benefits of this
>> approach:
>>
>>- Engine Independence: The metadata tables do not rely on a specific
>>implementation of an engine. The REST server returns the results directly.
>>For example, the Rust Iceberg does not need to implement its own logic to
>>query the snapshot table if it connects to a server with this capability.
>>This reduces the complexity and development effort required for different
>>clients and engines.
>>- Enabled New Use Cases: A catalog UI or Lakehouse UI can present a
>>table's metadata (e.g., snapshot/partition list) without relying on an
>>engine like Trino. This opens up possibilities for lightweight UIs and
>>tools that can directly interact with the REST endpoints to retrieve and
>>display metadata.
>>- Enhanced Performance: With server-side caching, the server-side
>>metadata tables will perform better. Caching reduces the need to 
>> repeatedly
>>compute or retrieve metadata, leading to faster response times and reduced
>>load on the underlying storage systems.
>>
>> Here is the proposal in google doc:
>> https://docs.google.com/document/d/1MVLwyMQtZ-7jewsQ0PuTvtJbpfl4HCoVdbowMqFTmfc/edit?usp=sharing
>>
>> Estimated read time: 5 mins
>>
>> Would really appreciate any feedback on this topic and proposal!
>>
>>
>> Yufei
>>
>


Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
Hi Piotr

Thanks for the reply.  It’s a good point, I was thinking it would be convenient 
in REST, and could avoid the hassle of spec change.  But you are right that it 
probably belongs at a lower level if we support this feature generally (like an 
additional boolean on snapshot).

Sorry to hijack the thread of the main topic, will start a proper thread on 
this when I get a chance.

Thanks
Szehon

> On Jul 3, 2024, at 11:26 PM, Piotr Findeisen  
> wrote:
> 
> Hi Szehon,
> 
> re listing 'removed' snapshots
> 
> If I understand what you're saying is the following: Iceberg table format 
> requires users to first delete metadata information about files and only then 
> delete the files, and sometimes users want to order these events differently.
> We can solve this within a REST catalog, because REST catalog is not limited 
> by the Iceberg spec. In particular, it can do copies of metadata and other 
> workarounds.
> However, why wouldn't we choose to solve this within Iceberg format? A naive 
> person could think that it's conceptually trivial to mark a snapshot as 
> 'expired' to allow data file removal without removing all the snapshot 
> information yet.
> Please help my understand the reasoning behind these tradeoffs.
> 
> Best
> PF
> 
> 
> 
> 
> On Thu, 4 Jul 2024 at 02:26, Szehon Ho  <mailto:szehon.apa...@gmail.com>> wrote:
>> Yes, I was chatting with Yufei about this, in the first glance I agree this 
>> would be nice to have.  I always thought that metadata tables are important 
>> enough to spec somewhere, and I think this is a nice place to do it.  There 
>> seems to be some overlap with existing calls (ie, you can get snapshots from 
>> table. and files from proposed Plan API), but it does seem valuable to get 
>> it in one place.  
>> 
>> If we can solve the 'big metadata' issue for PrePlan/PlanTable API's, it 
>> sounds like we can re-use the solution for files metadata tables.  I'd 
>> perhaps leave out position_deletes one though, as it's mostly used 
>> internally and seems a bit too 'big' even for this.
>> 
>> I wonder if we can even add an optional endpoint for listing 'removed' 
>> snapshots.   I know it sounds weird, but when looking at metadata tables, 
>> the one question that I got a lot but could not answer is how to find when a 
>> data file is added (or a partition is added).  If the snapshot is expired 
>> then it is no longer possible to trace that history.  Users often expire 
>> snapshots to claw back disk space, but may necessarily want to delete the 
>> snapshot history.  But I believe the REST catalog seems to have an 
>> opportunity in removeSnapshot to preserve the metadata of the old snapshot 
>> (up to some configured time).  So we can query the snapshot metadata even 
>> after it expires, which I feel will be valuable.
>> 
>> Thanks
>> Szehon
>> 
>> 
>> On Wed, Jul 3, 2024 at 3:04 PM Jack Ye > <mailto:yezhao...@gmail.com>> wrote:
>>> Hi Yufei,
>>> 
>>> Interesting that we are thinking about similar things. I had this item as a 
>>> part of the roadmap discussion items in the catalog sync meeting, and then 
>>> I removed it before the meeting because I felt it's too early to discuss.
>>> 
>>> My main concern for having server-side metadata tables is how we solve the 
>>> "big metadata" issue. The partitions, manifests, files table can easily 
>>> itself become a big table, and the REST server becomes inefficient in 
>>> retrieving results. It's the same old "HMS is too slow in iterating through 
>>> the partitions" problem. Iceberg kind of solves it by having this 
>>> information in Avro and in storage that can be scanned distributedly, but 
>>> with server-side metadata tables, we are technically re-introducing the 
>>> problem.
>>> 
>>> Maybe one potential approach is to run those potentially large metadata 
>>> table scans through the PreplanTable and PlanTable APIs. Just a quick 
>>> thought for now, I need to think a bit more about this.
>>> 
>>> Best,
>>> Jack Ye
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Jul 3, 2024 at 1:45 PM Yufei Gu >> <mailto:flyrain...@gmail.com>> wrote:
>>>> Hi folks,
>>>> 
>>>> I'd like to discuss a new proposal to support server-side metadata tables.
>>>> 
>>>> One of Iceberg's most advantageous features is the ability to inspect a 
>>>>

[DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-05 Thread Szehon Ho
Hi folks,

I would like to discuss an idea for an optional extension of Iceberg's
Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
that this should be a fuller Iceberg format change.

*Proposal Summary*

Currently, ExpireSnapshots(long olderThan) purges metadata and deleted data
of a Snapshot together.  Purging deleted data often requires a smaller
timeline, due to strict requirements to claw back unused disk space,
fulfill data lifecycle compliance, etc.  In many deployments, this means
'olderThan' timestamp is set to just a few days before the current time
(the default is 5 days).

On the other hand, purging metadata could be ideally done on a more relaxed
timeline, such as months or more, to allow for meaningful historical table
analysis.

We should have an optional way to purge Snapshot metadata separately from
purging deleted data.  This would allow us to get history of the table, and
answer questions like:

   - When was a file/partition added
   - When was a file/partition deleted
   - How much data was added or removed in time X

that are currently only possible for data operations within a few days.

*Github Proposal*:  https://github.com/apache/iceberg/issues/10646
*Google Design Doc*:
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit


Curious if anyone has thought along these lines and/or sees obvious
issues.  Would appreciate any feedback on the proposal.

Thanks
Szehon


Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Szehon Ho
Thanks for the comments so far.  I also thought previously that this
functionality would be in an external system, like LakeChime, or a custom
catalog extension.  But after doing an initial analysis (please double
check), I thought it's a small enough change that it would be worth putting
in the Iceberg spec/API's for all users:

   - Table Spec, only one optional boolean field (on Snapshot, only set if
   the functionality is used).
   - API, only one boolean parameter (on ExpireSnapshots).

I do wonder, will keeping expired snapshots as is slow down manifest/scan
> planning though (REST catalog approaches could probably mitigate this)?
>

I think it should not slow down manifest/scan planning, because we plan
using the current snapshot (or the one we specify via time travel), and we
wouldn't read expired snapshots in this case.

Thanks
Szehon

On Mon, Jul 8, 2024 at 10:54 AM John Greene  wrote:

> I do agree with the need that this proposal solves, to decouple the
> snapshot history from the data deletion. I do wonder, will keeping expired
> snapshots as is slow down manifest/scan planning though (REST catalog
> approaches could probably mitigate this)?
>
> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen 
> wrote:
>
>> Hi Shehon, Walaa
>>
>> Thank Shehon for bringing this up. And thank you Walaa for proving more
>> context from similar existing solution to the problem.
>> The choices that LakeChime seems to have made -- to keep information in a
>> separate RDBMS and which particular metadata information to retain -- they
>> indeed look as use-case specific, until we observe repeating patterns.
>> The idea to bake lifecycle changes into table format spec was proposed as
>> an alternative to the idea to bake lifecycle changes into REST catalog
>> spec. It was brought into discussion based on the intuition that REST
>> catalog is first-class citizen in Iceberg world, just like other catalogs,
>> and so solutions to table-centric problems do not need to be limited to
>> REST catalog. What is the information we retain, how/whether this is
>> configurable are open question and applicable to both avenues.
>>
>> As a 3rd/another alternative, we could focus on REST catalog *extensions*,
>> without naming snapshot metadata lifecycle, and leave the problem up to
>> REST's implementors. That would mean Iceberg project doesn't address
>> snapshot metadata lifecycle changes topic directly, but instead gives users
>> tools to build solutions around it. At this point I am not trying to judge
>> whether it's a good idea or not. Probably depends how important it is to
>> solve the problem and have a common solution.
>>
>> Best,
>> Piotr
>>
>>
>>
>>
>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa 
>> wrote:
>>
>>> Hi Szehon,
>>>
>>> Thanks for sharing this proposal. We have thought along the same lines
>>> and implemented an external system (LakeChime [1]) that retains snapshot +
>>> partition metadata for longer (actual internal implementation keeps data
>>> for 13 months, but that can be tuned). For efficient analysis, we have kept
>>> this data in an RDBMS. My opinion is this may be a better fit to an
>>> external system (similar to LakeChime) since it could potentially
>>> complicate the Iceberg spec, APIs, or their implementations. Also, the type
>>> of metadata tracked can differ depending on the use case. For example,
>>> while LakeChime retains partition and operation type metadata, it does not
>>> track file-level metadata as there was no specific use case for that.
>>>
>>> [1]
>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho 
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I would like to discuss an idea for an optional extension of Iceberg's
>>>> Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
>>>> that this should be a fuller Iceberg format change.
>>>>
>>>> *Proposal Summary*
>>>>
>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
>>>> data of a Snapshot together.  Purging deleted data often requires a smaller
>>>> timeline, due to strict requirements to claw back unused disk space,
>>>> fulfill data lifecycle compliance, etc.  In many deployments, this means
>>>> 'olderThan' timestamp is set to just a few days

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Szehon Ho
Thanks Peter and Yufei.

Yes, in terms of implementation, I noted in the doc we need to add error
checks to prevent time-travel / rollback / cherry-pick operations to
'expired' snapshots.  I'll make it more clear in the doc, which operations
we need to check against.

I believe DeleteOrphanFiles may be ok as is, because currently the logic
walks down the reachable graph and marks those metadata files as
'not-orphan', so it should naturally walk these 'expired' snapshots as well.

So, I think the main changes in terms of implementations is going to be
adding error checks in those Table API's, and updating ExpireSnapshots API.

Do we want to consider expiring snapshots in the middle of the history of
> the table?
>
You mean purging expired snapshots in the middle of the history, right?  I
think the current mechanism for this is 'tagging' and 'branching'.  So
interestingly, I was thinking its related to your other question, and if we
don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
it could be handled just as they are handled today for other snapshots.
Its one option.  We could support it subsequently as well , after the first
version and if there's some usage of this.

One thing that comes up in this thread and google doc is some question
about the size of preserved metadata.  I had put in the Alternatives
section, that we could potentially make the ExpireSnapshots purge boolean
argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
preserved), though I am still debating if its worth it, as users could
choose not to use this feature.

Thanks
Szehon



On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu  wrote:

> Thank you for the interesting proposal. With a minor specification change,
> it could indeed enable different retention periods for data files and
> metadata files. This differentiation is useful for two reasons:
>
>1. More metadata helps us better understand the table history,
>providing valuable insights.
>2. Users often prioritize data file deletion as it frees up
>significant storage space and removes potentially sensitive data.
>
> However, adding a boolean property to the specification isn't necessarily
> a lightweight solution. As Peter mentioned, implementing this change
> requires modifications in several places. In this context, external systems
> like LakeChime or a REST catalog implementation could offer effective
> solutions to manage extended metadata retention periods, without spec
> changes.
>
> I am neutral on this proposal (+0) and look forward to seeing more input
> from people.
> Yufei
>
>
> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry 
> wrote:
>
>> We need to handle expired snapshots in several places differently in
>> Iceberg core as well.
>> - We need to add checks to prevent scans read these snapshots and throw a
>> meaningful error.
>> - We need to add checks to prevent tagging/branching these snapshots
>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>> files only referenced by the expired snapshots
>>
>> Some Flink jobs do frequent commits, and in these cases, the size of the
>> metadata file becomes a constraining factor too. In this case, we could
>> just tell not to use this feature, and expire the metadata as we do now,
>> but I thought it's worth to mention.
>>
>> Do we want to consider expiring snapshots in the middle of the history of
>> the table?
>> When we compact the table, then the compaction commits litter the real
>> history of the table. Consider the following:
>> - S1 writes some data
>> - S2 writes some more data
>> - S3 compacts the previous 2 commits
>> - S4 writes even more data
>> From the query engine user perspective S3 is a commit which does nothing,
>> not initiated by the user, and most probably they don't even want to know
>> of. If one can expire a snapshot from the middle of the history, that would
>> be nice, so users would see only S1/S2/S4. The only downside is that
>> reading S2 is less performant than reading S3, but IMHO this could be
>> acceptable for having only user driven changes in the table history.
>>
>>
>> In Mon, Jul 8, 2024, 20:15 Szehon Ho  wrote:
>>
>>> Thanks for the comments so far.  I also thought previously that this
>>> functionality would be in an external system, like LakeChime, or a custom
>>> catalog extension.  But after doing an initial analysis (please double
>>> check), I thought it's a small enough change that it would be worth putting
>>> in the Iceberg 

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
Hi,

Just FYI, good news, this change is merged on the Spark side :
https://github.com/apache/spark/pull/46707 (its the third effort!).  In
next version of Spark, we will be able to pass read properties via SQL to a
particular Iceberg table such as

SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`)

I will look at write options after this.

There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes as
well, it should also be coming soon in Spark.

Thanks,
Szehon



On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon 
wrote:

> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
> support for these operations. There is no DataFrame API support for them.*
> Therefore write options are not applicable. Thus SQLConf is the only
> available mechanism I can use to override the table property.
> For reference, we currently support setting distribution mode using write
> option, SQLConf and table property. It seems to me that
> https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd
> like to do.
>
> * It would be of interest to support performing DELETE/UPDATE/MERGE from
> DataFrames, but that is a whole other topic.
>
>
> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue  wrote:
>
>> I think we should aim to have the same behavior across properties that
>> are set in SQL conf, table config, and write options. Having SQL conf
>> override table config for this doesn't make sense to me. If the need is to
>> override table configuration, then write options are the right way to do it.
>>
>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon
>>  wrote:
>>
>>> I was on vacation.
>>> Currently, write modes (copy-on-write/merge-on-read) can only be set as
>>> table properties, and default to copy-on-write. We have a customer who
>>> wants to use copy-on-write for certain Spark jobs that write to some
>>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>>> table, because of the write characteristics of those jobs. This seems like
>>> a use case that should be supported. The only way they can do this
>>> currently is to toggle the table property as needed before doing the
>>> writes. This is not a sustainable workaround.
>>> Hence, I think it would be useful to be able to configure the write mode
>>> as a SQLConf. I also disagree that the table property should always win. If
>>> this is the case, there is no way to override it. The existing behavior in
>>> SparkConfParser is to use the option if set, else use the session conf if
>>> set, else use the table property. This applies across the board.
>>> - Wing Yew
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue  wrote:
>>>
 Yes, I agree that there is value for administrators from having some
 things exposed as Spark SQL configuration. That gets much harder when you
 want to use the SQLConf for table-level settings, though. For example, the
 target split size is something that was an engine setting in the Hadoop
 world, even though it makes no sense to use the same setting across vastly
 different tables --- think about joining a fact table with a dimension
 table.

 Settings like write mode are table-level settings. It matters what is
 downstream of the table. You may want to set a *default* write mode, but
 the table-level setting should always win. Currently, there are limits to
 overriding the write mode in SQL. That's why we should add hints. For
 anything beyond that, I think we need to discuss what you're trying to do.
 If it's to override a table-level setting with a SQL global, then we should
 understand the use case better.

 On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
  wrote:

> Also, in the case of write mode (I mean write.delete.mode,
> write.update.mode, write.merge.mode), these cannot be set as options
> currently; they are only settable as table properties.
>
> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon 
> wrote:
>
>> I think that different use cases benefit from or even require
>> different solutions. I think enabling options in Spark SQL is helpful, 
>> but
>> allowing some configurations to be done in SQLConf is also helpful.
>> For Cheng Pan's use case (to disable locality), I think providing a
>> conf (which can be added to spark-defaults.conf by a cluster admin) is
>> useful.
>> For my customer's use case (
>> https://github.com/apache/iceberg/pull/7790), being able to set the
>> write mode per Spark job (where right now it can only be set as a table
>> property) is useful. Allowing this to be done in the SQL with an
>> option/hint could also work, but as I understand it, Szehon's PR (
>> https://github.com/apache/spark/pull/416830) is only applicable to
>> reads, not writes.
>>
>> - Wing Yew
>>
>>
>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan  wrote:
>>
>>> Ryan, I understand th

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
Sure, the prs are https://github.com/apache/spark/pull/44119 (merge),
https://github.com/apache/spark/pull/47233 (update), and delete in progress.

Thanks
Szehon

On Tue, Jul 9, 2024 at 10:27 PM Wing Yew Poon 
wrote:

> Hi Szehon,
> Thanks for the update.
> Can you please point me to the work on supporting DELETE/UPDATE/MERGE in
> the DataFrame API?
> Thanks,
> Wing Yew
>
>
> On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho  wrote:
>
>> Hi,
>>
>> Just FYI, good news, this change is merged on the Spark side :
>> https://github.com/apache/spark/pull/46707 (its the third effort!).  In
>> next version of Spark, we will be able to pass read properties via SQL to a
>> particular Iceberg table such as
>>
>> SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`)
>>
>> I will look at write options after this.
>>
>> There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes
>> as well, it should also be coming soon in Spark.
>>
>> Thanks,
>> Szehon
>>
>>
>>
>> On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon
>>  wrote:
>>
>>> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
>>> support for these operations. There is no DataFrame API support for them.*
>>> Therefore write options are not applicable. Thus SQLConf is the only
>>> available mechanism I can use to override the table property.
>>> For reference, we currently support setting distribution mode using
>>> write option, SQLConf and table property. It seems to me that
>>> https://github.com/apache/iceberg/pull/6838/ is a precedent for what
>>> I'd like to do.
>>>
>>> * It would be of interest to support performing DELETE/UPDATE/MERGE from
>>> DataFrames, but that is a whole other topic.
>>>
>>>
>>> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue  wrote:
>>>
>>>> I think we should aim to have the same behavior across properties that
>>>> are set in SQL conf, table config, and write options. Having SQL conf
>>>> override table config for this doesn't make sense to me. If the need is to
>>>> override table configuration, then write options are the right way to do 
>>>> it.
>>>>
>>>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon
>>>>  wrote:
>>>>
>>>>> I was on vacation.
>>>>> Currently, write modes (copy-on-write/merge-on-read) can only be set
>>>>> as table properties, and default to copy-on-write. We have a customer who
>>>>> wants to use copy-on-write for certain Spark jobs that write to some
>>>>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>>>>> table, because of the write characteristics of those jobs. This seems like
>>>>> a use case that should be supported. The only way they can do this
>>>>> currently is to toggle the table property as needed before doing the
>>>>> writes. This is not a sustainable workaround.
>>>>> Hence, I think it would be useful to be able to configure the write
>>>>> mode as a SQLConf. I also disagree that the table property should always
>>>>> win. If this is the case, there is no way to override it. The existing
>>>>> behavior in SparkConfParser is to use the option if set, else use the
>>>>> session conf if set, else use the table property. This applies across the
>>>>> board.
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue  wrote:
>>>>>
>>>>>> Yes, I agree that there is value for administrators from having some
>>>>>> things exposed as Spark SQL configuration. That gets much harder when you
>>>>>> want to use the SQLConf for table-level settings, though. For example, 
>>>>>> the
>>>>>> target split size is something that was an engine setting in the Hadoop
>>>>>> world, even though it makes no sense to use the same setting across 
>>>>>> vastly
>>>>>> different tables --- think about joining a fact table with a dimension
>>>>>> table.
>>>>>>
>>>>>> Settings like write mode are table-level settings. It matters what is
>>>>>> downstream of the table. You may want to set a *default* write mode, but
>>>>>> the table-level set

Re: [VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-11 Thread Szehon Ho
+1

Thanks
Szehon

On Thu, Jul 11, 2024 at 11:02 AM Daniel Weeks  wrote:

> +1 (binding)
>
> On Thu, Jul 11, 2024 at 10:54 AM Anurag Mantripragada
>  wrote:
>
>> +1 (non-binding) .Thanks Steve
>>
>>
>> Anurag Mantripragada
>>
>> On Jul 11, 2024, at 10:27 AM, Yufei Gu  wrote:
>>
>> +1 (binding) Thanks for doing this, Steven.
>> Yufei
>>
>>
>> On Thu, Jul 11, 2024 at 10:16 AM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> + 1 (non-binding).
>>>
>>> Thanks,
>>>
>>> Amogh Jahagirdar
>>>
>>> On Thu, Jul 11, 2024 at 10:25 AM Péter Váry 
>>> wrote:
>>>
 +1 (non-binding)

 On Thu, Jul 11, 2024, 17:31 Jack Ye  wrote:

> +1 (binding)
>
> On Thu, Jul 11, 2024 at 3:37 AM Piotr Findeisen <
> piotr.findei...@gmail.com> wrote:
>
>> it looks it's part of the spec that's not connected to the other
>> parts of the spec (like "dead code")
>>
>> +1 (non binding)
>>
>>
>> On Thu, 11 Jul 2024 at 08:30, Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Thu, Jul 11, 2024 at 8:29 AM Ajantha Bhat 
>>> wrote:
>>>
 +1 (non-binding)

 - Ajantha

 On Thu, Jul 11, 2024 at 11:02 AM Jean-Baptiste Onofré <
 j...@nanthrax.net> wrote:

> +1 (non binding)
>
> Regards
> JB
>
> On Thu, Jul 11, 2024 at 12:50 AM Steven Wu 
> wrote:
> >
> > Following the latest community guidelines, I would like to start
> a voting thread on removing the JSON spec for content file and file 
> scan
> task. Here is the PR for the spec change [1]
> >
> > This was previously discussed in the dev mailing list [2]. While
> it is good to add the JSON serializer in iceberg-core for ContentFile 
> and
> FileScanTask, their JSON formats don't need to be added to the core 
> table
> spec.
> >
> > Please vote in the next 72 hours.
> >
> > Thanks,
> > Steven
> >
> > [1] https://github.com/apache/iceberg/pull/9771
> > [2]
> https://lists.apache.org/thread/2ty27yx4q0zlqd5h71cyyhb5k47yf9bv
> >
>

>>


Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-16 Thread Szehon Ho
Hi,

Thanks for reading through the proposal and the good feedback. I was
thinking about the mentioned concerns:

   - The motivation for the change
   - Too much additional metadata (storage overhead, namenode pressure on
   HDFS)
   - Performance impact for read/writing TableMetadata
   - Some impact to existing Table API's, and maintenance procedures, to
   have to check for these snapshots

I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
proposal at the same link:
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
I also tried to clarify the motivation in the doc with actual metadata
table queries that would be possible.

This version now simply adds an optional 'expired-snapshots-path' that
contains the metadata of expired Snapshots.  I think this should address
the above concerns:

   - Minimal storage overhead for just snapshot references (capped).  I
   don't propose anymore to keep old snapshot manifest-list/manifest files,
   the snapshot reference to the expired snapshot should be a good start.
   - Minimize perf overhead of read/write TableMetadata.  The additional
   file is only written by ExpireSnapshots if feature is enabled, and only
   read on demand (via metadata table query for example)
   - No impact to other Table APIs or maintenance procedures (as these dont
   show up as regular table.snapshots() list anymore).
   - Only additive optional spec change (backwards compatible)

Of course, again, this feature is possible outside Iceberg, but the
advantage of doing it in Iceberg is that it could be integrated into
ExpireSnapshots and Metadata Table frameworks.

Curious what people think?

Thanks
Szehon

On Wed, Jul 10, 2024 at 1:44 AM Péter Váry 
wrote:

> > I believe DeleteOrphanFiles may be ok as is, because currently the logic
> walks down the reachable graph and marks those metadata files as
> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>
> We need to keep the metadata files, but remove data files if they are not
> removed for whatever reason. Doable, but logic change.
>
> > You mean purging expired snapshots in the middle of the history, right?
> I think the current mechanism for this is 'tagging' and 'branching'.
>
> I think for most users the compaction commits are technical details which
> they would like to avoid / don't want to see. The real table history is
> only the changes initiated by the user, and it would be good to hide the
> technical/compaction commits.
>
>
> On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:
>
>> Hi Szehon,
>>
>> This is a good idea considering the use case it intends to solve. Added
>> few questions and comments in the design doc.
>>
>> IMO , Alternate options considered specified in the design doc look
>> cleaner to me.
>>
>> I think, it might add to maintenance burden, now that we need to remember
>> to remove these metadata only snapshots.
>>
>> Also I wonder some of the use cases it intends to address, is solvable by
>> metadata alone? - i.e how much data was added in a given time range? - May
>> be to answer these kind of questions user would prefer a to create KPI
>> using columns in the dataset.
>>
>>
>> Regards,
>> Himadri Pal
>>
>>
>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:
>>
>>> I am not totally convinced of the motivation yet.
>>>
>>> I thought the snapshot retention window is primarily meant for time
>>> travel and troubleshooting table changes that happened recently (like a few
>>> days or weeks).
>>>
>>> Is it valuable enough to keep expired snapshots for as long as months or
>>> years? While metadata files are typically smaller than data files in total
>>> size, it can still be significant considering the default amount of column
>>> stats written today (especially for wide tables with many columns).
>>>
>>> How long are we going to keep the expired snapshot references by
>>> default? If it is months/years, it can have major implications on the query
>>> performance of metadata tables (like snapshots, all_*).
>>>
>>> I assume it will also have some performance impact on table loading as a
>>> lot more expired snapshots are still referenced.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho 
>>> wrote:
>>>
>>>> Thanks Peter and Yufei.
>>>>
>>>> Yes, in terms of implementation, I noted in the doc we need to add
>>>> error checks to prevent time-travel / rollback / cherry-pick operations to
>>>

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Szehon Ho
Hi Gabor

I'm neutral for this, but can be convinced.  My initial thoughts is that
there would be no way to have ADD PARTITION (I assume old Hive workloads
would rely on this), and these are not ANSI SQL standard statements as
Spark moves to that direction.

The second point of guaranteeing a metadata only operation is interesting,
an alternate would be to have a flag to fail unless query can be answered
by metadata.

Thanks
Szehon



On Wed, Jul 17, 2024 at 2:12 AM Gabor Kaszab  wrote:

> Hey Community,
>
> I learned recently that Spark doesn't support DROP PARTITION for Iceberg
> tables. I understand this is because the DROP PARTITION is something being
> used for Hive tables and Iceberg's model for hidden partitioning makes it
> unnatural to have commands like this.
>
> However, I think that DROP PARTITION would still have some value for
> users. In fact in Impala we implemented this even for Iceberg tables.
> Benefits could be:
>  - Users having workloads on Hive tables could use their workloads after
> they migrated their tables to Iceberg.
>  - Opposed to DELETE FROM, DROP PARTITION has a guarantee that this is
> going to be a metadata only operation and no delete files are going to be
> written.
>
> I'm curious what the community thinks of this.
> Gabor
>
>


Re: Building with JDK 21

2024-07-22 Thread Szehon Ho
Thanks Piotr for driving this, late +1 to add JDK 21 support and your plan
for spotless.

It seems ok to me too to bite the bullet and move to newer spotless
(disabling spotless for JDK8 builds) post 1.6, but looks like the
discussion happened and I'm fine either way.

Thanks!
Szehon

On Mon, Jul 22, 2024 at 6:38 AM Piotr Findeisen 
wrote:

> Thanks Fokko.
> I like the idea. Started a new "Dropping JDK 8 support" thread to ensure
> transparency.
>
> Best
> Piotr
>
>
> On Mon, 22 Jul 2024 at 15:24, Fokko Driesprong  wrote:
>
>> Thanks for summarizing this, Piotr.
>>
>> I believe having a separate thread on dropping Java 8 is the right thing
>> to do. We want to be as transparent about these changes as possible.
>>
>> Kind regards,
>> Fokko Driesprong
>>
>> Op ma 22 jul 2024 om 14:37 schreef Piotr Findeisen <
>> piotr.findei...@gmail.com>:
>>
>>> Thanks for this lively discussion, it is great to see so many great
>>> people involved!
>>>
>>> We have unanimous agreement that we add support for JDK 21.
>>> Partial support (without spotless) will be added after 1.6.0 release is
>>> out (just not to mess up with the release).
>>> Full support (with spotless) will be added as soon as JDK 8 is dropped.
>>> For 21 we have clarity and we don't have preconditions, this is great.
>>> And this is a non-destructive operation too.
>>>
>>> We do not have agreement for dropping Hive module. This will be
>>> discussed separately on a new thread.
>>>
>>> We also seem to have unanimous agreement for dropping JDK 8.
>>> As to the timeline, it was proposed to do this in 2.0 release, so let's
>>> roll with this, unless there are new objections
>>> Since dropping support for something can be seen as a destructive
>>> operation, does it require a formal vote?
>>> Or do we treat +1 and -1 on this thread as votes already cast?
>>>
>>> Best,
>>> Piotr
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, 20 Jul 2024 at 00:12, Jack Ye  wrote:
>>>
 +1 for dropping JDK8 support and adding JDK21.

 > What does dropping Java 8 support mean to companies that are still
 using Java 8 for Iceberg in production?

 From the AWS side, AWS Corretto JDK8 end of life is July 2026, see:
 https://aws.amazon.com/corretto/faqs/#support_calendar. I would
 suggest at least migrate before that time.

 -Jack



 On Fri, Jul 19, 2024 at 3:02 PM John Zhuge  wrote:

> +1 adding java 21 support
> +1 removing java 8 support
>
> On Fri, Jul 19, 2024 at 1:33 PM Daniel Weeks 
> wrote:
>
>> I'm also in favor of removing Java 8 support.  Hive docs state Hive
>> 3 requires java 8
>>  and in
>> prior cases there were potential correctness issues when running with 
>> newer
>> Java versions (these may have been addressed).
>>
>> As long as we're not updating the target version, I think we should
>> be ok as they can still run in Java 8 if that remains a requirement.
>>
>> +1 to removing Java 8 support
>> +1 to adding Java 21 support.
>>
>> -Dan
>>
>>
>>
>> On Fri, Jul 19, 2024 at 1:04 PM Ryan Blue 
>> wrote:
>>
>>> I agree that if we can separate the discussion about how to support
>>> Hive, then we should do that.
>>>
>>> +1 to removing Java 8 support
>>> +1 to adding Java 21 support.
>>>
>>> On Fri, Jul 19, 2024 at 12:58 PM huaxin gao 
>>> wrote:
>>>
 +1 in favor of adding java 21 support
 +1 in favor of removing java 8 support

 I am currently working on Spark 4.0 / Iceberg integration
 . Spark 4.0 runs on
 Java 17/21.

 On Fri, Jul 19, 2024 at 4:58 AM Piotr Findeisen <
 piotr.findei...@gmail.com> wrote:

> Hi,
>
> We recently started to test Hive3 with Java 11 and 17
>  and the tests pass.
> So dropping Java 8 doesn't technically require removing the Hive 3
> related modules, unless users cannot do anything useful with them 
> (because
> e.g. they can only run Hive runtime with Java 8 for some reason).
> Peter, can you please confirm this is not the case?
> Then it seems we could proceed with JDK 8 drop and discuss what to
> do with Hive modules *separately*.
>
> re original question of adding JDK 21 support -- we seem to have
> strong consensus to add it.
> Eduard plans to merge the PR once 1.6.0 is out. So I think we no
> longer need to debate this topic, unless there are any new objections 
> to be
> raised.
>
>
> Best
> Piotr
>
>
>
>
>
>
>
> On Fri, 19 Jul 2024 at 13:49, Péter Váry <
> peter.vary.apa...@gmail.com> w

Re: Dropping JDK 8 support

2024-07-22 Thread Szehon Ho
+1 for dropping JDK 8 in Iceberg 2.0.  I also wonder the same thing as
Huaxin (sorry if I missed a previous thread on Iceberg 2.0 plan).

Also as Huaxin has discovered in Spark 4.0 Support PR
, looks like we may have to
drop Java8 first in Spark 4.0 module, due to it being dropped in Spark 4.0,
before Iceberg 2.0.

Thanks
Szehon

On Mon, Jul 22, 2024 at 6:31 PM huaxin gao  wrote:

> +1 (non-binding)
>
> I have a question about iceberg versioning. After the 1.6 release, will
> there be versions 1.7, 1.8 and 1.9, or will it go straight to 2.0?
>
> On Mon, Jul 22, 2024 at 5:32 PM Manu Zhang 
> wrote:
>
>> If JDK 8 support is dropped in 2.0, will we continue to fix critical
>> issues in 1.6+?
>>
>> On Tue, Jul 23, 2024 at 1:35 AM Jack Ye  wrote:
>>
>>> +1 (binding), I did not expect this to be a vote thread, but overall +1
>>> for dropping JDK8 support.
>>>
>>> -Jack
>>>
>>> On Mon, Jul 22, 2024 at 10:30 AM Yufei Gu  wrote:
>>>
 +1(binding), as much as I want to drop JDK 8, still encourage everyone
 to spark out about any concerns.
 Yufei


 On Mon, Jul 22, 2024 at 10:24 AM Steven Wu 
 wrote:

> +1 (binding)
>
> On Mon, Jul 22, 2024 at 6:37 AM Piotr Findeisen <
> piotr.findei...@gmail.com> wrote:
>
>> Hi,
>>
>> in the "Building with JDK 21" email thread we discussed adding JDK 21
>> support and also dropping JDK 8 support, as these things were initially
>> related.
>> A lot of people expressed acceptance for dropping JDK 8 support, and
>> release 2.0 was proposed as a timeline.
>> There were also concerned raised, as some people still use JDK 8.
>>
>> Let me start this new thread for a discussion and perhaps formal vote
>> for dropping JDK 8 support in Iceberg 2.0 release.
>>
>> Best
>> Piotr
>>
>>


Re: [VOTE] Drop Java 8 support in Iceberg 1.7.0

2024-07-26 Thread Szehon Ho
+1 (binding)

Thanks
Szehon

On Fri, Jul 26, 2024 at 8:55 AM Steven Wu  wrote:

> +1 (binding)
>
> I would also suggest keeping the vote open for 7 days for a larger
> decision like this.
>
>
> On Fri, Jul 26, 2024 at 8:50 AM Ryan Blue 
> wrote:
>
>> +1
>>
>> On Fri, Jul 26, 2024 at 8:42 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> +1 (bind)
>>>
>>> On Fri, Jul 26, 2024 at 8:34 AM Péter Váry 
>>> wrote:
>>>
 +1 (non-binding)

 Ajantha Bhat  ezt írta (időpont: 2024. júl.
 26., P, 14:51):

> +1
>
> On Fri, Jul 26, 2024 at 5:16 PM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> +1 (non-binding) for dropping JDK8 support with Iceberg 1.7.0
>>
>> On Fri, Jul 26, 2024 at 1:29 PM Piotr Findeisen <
>> piotr.findei...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Dropping support for building and running on Java 8 was discussed
>>> previously on "Dropping JDK 8 support" and "Building with JDK 21" mail
>>> threads.
>>>
>>> As JB kindly pointed out, for a vote we need a "VOTE" thread, so
>>> here we go.
>>> Question: Should we drop Java 8 support in Iceberg 1.7.0?
>>>
>>> Best,
>>> Piotr
>>>
>>> PS
>>> +1 (non-binding) from me
>>>
>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>


Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
Hi,

Also if I read it correctly, I think this proposal imposes the following
workflows in "spec" folders :

   1. Large and functional changes.  These redirect to Iceberg improvement
   proposals, which ends in code-modification vote
   2. bug-fixes or clarification, which is specified to require
   code-modification vote
   3. grammar, spelling, minor formatting fix, not covered here (I guess it
   is like any normal code review?)

To me, (2) is a bit new here and in the grey area for interpretation.  I
was thinking about this while reviewing
https://github.com/apache/iceberg/pull/10793, which could be a category (2)
and non-functional change but would need a full code-modification vote as
per [Iceberg improvement proposal](#apache-iceberg-improvement-proposals).
I can see both sides, to avoid a potential dispute/misunderstanding over
the clarification, it would be nice to have a vote on the devlist.  But it
may also be yet another burden, when something can be more easily decided
on the github discussion itself via approval by the relevant parties.  So I
think I would agree with Ryan in mentioning that significant (would maybe
add "functional") spec change needs a vote on the dev list.

Thanks
Szehon

On Mon, Jul 29, 2024 at 1:16 PM Ryan Blue 
wrote:

> I think the proposed doc looks good, but I'm not sure that it is better to
> add this to our guidelines.
>
> On one hand the doc describes how ASF communities work in general:
> committers review and commit PRs and are expected to use good judgement,
> ask one another for help when necessary, and broaden the set of people in
> the discussion when there's a disagreement. I really appreciate that Micah
> called out that this is intentionally vague to emphasize committer
> judgement.
>
> The problem I'm worried about is the tendency to misuse docs like this and
> become focused on it as a rule. People tend to apply written rules
> mechanically and I worry about people substituting a reading of this text
> for judgement. For example, a strict reading of "encouraged to ask another
> committer" means that it is optional.
>
> Given that the majority of the content here is stating how ASF communities
> work and the only Iceberg-specific parts are the proposal process and
> calling out that we vote on spec changes, I would probably just have a
> description of how to handle proposals (which is already there) and a note
> that significant spec changes should use a vote on the dev list.
>
> On Sun, Jul 28, 2024 at 11:15 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Micah
>>
>> Thanks ! It looks good to me now you have included comments from everyone.
>>
>> Regards
>> JB
>>
>> On Fri, Jul 26, 2024 at 1:15 AM Micah Kornfield 
>> wrote:
>> >
>> > As part of the bylaws discussions that have been happening, we are
>> trying to make small focused proposals to move things forward.  As a first
>> step towards this I created a proposal for guidelines on committing pull
>> requests [1].  Feedback is appreciated.
>> >
>> > Given the level of interest in the discussions so far, it seems that
>> the best path forward is to hold an official vote before merging.  I intend
>> to do this once we appear to have consensus but if the people prefer we can
>> try to avoid the overhead.
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1] https://github.com/apache/iceberg/pull/10780
>>
>
>
> --
> Ryan Blue
> Databricks
>


Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
Typo , wrong link:
(2) requiring full code-modification vote as per [Iceberg improvement
proposal](#apache-iceberg-improvement-proposals) => full code-modification
vote as per [code modification](
https://www.apache.org/foundation/voting.html#votes-on-code-modification)

On Mon, Jul 29, 2024 at 1:53 PM Szehon Ho  wrote:

> Hi,
>
> Also if I read it correctly, I think this proposal imposes the following
> workflows in "spec" folders :
>
>1. Large and functional changes.  These redirect to Iceberg
>improvement proposals, which ends in code-modification vote
>2. bug-fixes or clarification, which is specified to require
>code-modification vote
>3. grammar, spelling, minor formatting fix, not covered here (I guess
>it is like any normal code review?)
>
> To me, (2) is a bit new here and in the grey area for interpretation.  I
> was thinking about this while reviewing
> https://github.com/apache/iceberg/pull/10793, which could be a category
> (2) and non-functional change but would need a full code-modification vote
> as per [Iceberg improvement
> proposal](#apache-iceberg-improvement-proposals).  I can see both sides, to
> avoid a potential dispute/misunderstanding over the clarification, it would
> be nice to have a vote on the devlist.  But it may also be yet another
> burden, when something can be more easily decided on the github discussion
> itself via approval by the relevant parties.  So I think I would agree with
> Ryan in mentioning that significant (would maybe add "functional") spec
> change needs a vote on the dev list.
>
> Thanks
> Szehon
>
> On Mon, Jul 29, 2024 at 1:16 PM Ryan Blue 
> wrote:
>
>> I think the proposed doc looks good, but I'm not sure that it is better
>> to add this to our guidelines.
>>
>> On one hand the doc describes how ASF communities work in general:
>> committers review and commit PRs and are expected to use good judgement,
>> ask one another for help when necessary, and broaden the set of people in
>> the discussion when there's a disagreement. I really appreciate that Micah
>> called out that this is intentionally vague to emphasize committer
>> judgement.
>>
>> The problem I'm worried about is the tendency to misuse docs like this
>> and become focused on it as a rule. People tend to apply written rules
>> mechanically and I worry about people substituting a reading of this text
>> for judgement. For example, a strict reading of "encouraged to ask another
>> committer" means that it is optional.
>>
>> Given that the majority of the content here is stating how ASF
>> communities work and the only Iceberg-specific parts are the proposal
>> process and calling out that we vote on spec changes, I would probably just
>> have a description of how to handle proposals (which is already there) and
>> a note that significant spec changes should use a vote on the dev list.
>>
>> On Sun, Jul 28, 2024 at 11:15 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Micah
>>>
>>> Thanks ! It looks good to me now you have included comments from
>>> everyone.
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Jul 26, 2024 at 1:15 AM Micah Kornfield 
>>> wrote:
>>> >
>>> > As part of the bylaws discussions that have been happening, we are
>>> trying to make small focused proposals to move things forward.  As a first
>>> step towards this I created a proposal for guidelines on committing pull
>>> requests [1].  Feedback is appreciated.
>>> >
>>> > Given the level of interest in the discussions so far, it seems that
>>> the best path forward is to hold an official vote before merging.  I intend
>>> to do this once we appear to have consensus but if the people prefer we can
>>> try to avoid the overhead.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> >
>>> > [1] https://github.com/apache/iceberg/pull/10780
>>>
>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>


Re: [DISCUSS] adoption of format version 3

2024-07-31 Thread Szehon Ho
Sorry I missed the sync this morning (sick), I'd like to push for geo too.

I think on this front as per the last sync, Ryan recommended to wait for
Parquet support to land, to avoid having two versions on Iceberg side
(Iceberg-native vs Parquet-native).  Parquet support is being actively
worked on iiuc: https://github.com/apache/parquet-format/pull/240 .  But it
would bind V3 to the parquet-format release timeline, unless we start with
iceberg-native support first and move later (as we originally proposed).

Thanks,
Szehon

On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa 
wrote:

> Another feature that was planned for V3 is support for default values.
> Spec doc update was already merged a while ago [1]. Implementation is
> ongoing in this PR [2].
>
> [1] https://iceberg.apache.org/spec/#default-values
> [2] https://github.com/apache/iceberg/pull/9502
>
> Thanks,
> Walaa.
>
> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>  wrote:
> >
> > Thanks for bringing this up, I would say that from my perspective I have
> time to really push through hopefully two things
> >
> > Variant Type and
> > Row Lineage (which I will have a proposal for on the mailing list next
> week)
> >
> > I'm using the Project to try to track logistics and minutia required for
> the new spec version but I would like to bring other work in there as well
> so we can get a clear picture of what is actually being actively worked on.
> >
> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
> jacobmar...@influxdata.com> wrote:
> >>
> >> Good morning,
> >>
> >> To continue the community sync today when format version 3 was
> discussed.
> >>
> >> Questions answered by consensus:
> >> - Format version releases should _not_ be tied to Iceberg version
> releases.
> >> - Several planned features will require format version releases; the
> process shouldn't be onerous.
> >>
> >> Unanswered questions:
> >> - What will be included in format version 3?
> >>   - What is a reasonable target date?
> >>   - How to track progress? Today, there are two public lists:
> >> - GH milestone: https://github.com/apache/iceberg/milestone/42
> >> - GH project: https://github.com/orgs/apache/projects/377
> >> - What is required of a feature in order to be included in any adopted
> format version?
> >>   - At least one complete reference implementation should exist.
> >> - Java is the reference implementation by convention; that's OK,
> but not perfect. Should Java be the reference implementation by mandate?
> >>
> >> Have I missed anything?
> >>
> >> --
> >> Jacob Marble
>


Re: [DISCUSS] adoption of format version 3

2024-08-06 Thread Szehon Ho
ake sense to start releasing the
>>>>> table specification on a regular cadence (e.g. quarterly, every 6 months 
>>>>> or
>>>>> yearly)?
>>>>>
>>>>> I have been a big advocate for releasing all the Iceberg specs
>>>>> regularly, and just follow a normal product release cycle with major and
>>>>> minor releases. I touched a bit of the reasoning in the thread for fixing
>>>>> stats fields in REST spec [1]. This helps a lot with engines that do not
>>>>> use any Iceberg open source library and just look at a spec and implement
>>>>> it. With a regular release, they can have a stable version to look into,
>>>>> rather than a spec that is changing all the time within the same version.
>>>>>
>>>>> It is important to note that minor spec versions will not be leveraged
>>>>> in implementations like how we have logics right now for switching
>>>>> behaviors depending on major versions. It is purely for the purpose of
>>>>> making more incremental progress on the spec, and providing stable spec
>>>>> versions for other reference implementations. Otherwise, the branches in
>>>>> the codebase to handle different versions easily get out of control.
>>>>>
>>>>> I think Fokko brought up a point that "this will introduce a process
>>>>> that will slow the evolution down", which is true because you need to 
>>>>> spend
>>>>> additional effort and release it. And without a reference implementation,
>>>>> it is hard to say if the spec is mature enough to be released, which again
>>>>> makes it potentially tied to the release cycle of at least the Java 
>>>>> library.
>>>>>
>>>>> Curious what people think.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> [1] https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
>>>>>
>>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <
>>>>> emkornfi...@gmail.com> wrote:
>>>>>
>>>>>> It sounds like most of the opinions so far are waiting for the scope
>>>>>> of work to finish before finalizing the specification.
>>>>>>
>>>>>> An alternative view: Would it make sense to start releasing the table
>>>>>> specification on a regular cadence (e.g. quarterly, every 6 months or
>>>>>> yearly)?  I think the problem with waiting for features to get in is that
>>>>>> priorities change and things take longer than expected, thus leaving the
>>>>>> actual finalization of the specification in limbo and probably adds to
>>>>>> project management overhead.   If the specification is released regularly
>>>>>> then it means features can always be included in the next release without
>>>>>> too much delay hopefully.  The main downside I can think of in this
>>>>>> approach is having to have more branches in code to handle different
>>>>>> versions.
>>>>>>
>>>>>> One corollary to this approach is spec changes shouldn't be merged
>>>>>> before their implementations are ready.
>>>>>>
>>>>>>   - At least one complete reference implementation should exist.
>>>>>>
>>>>>>
>>>>>> For more complicated features I think at some point soon it might be
>>>>>> worth considering two implementations (or at least 1 full implementation
>>>>>> and 1 read only implementation) to make sure there aren't compatibility
>>>>>> issues/misunderstandings in the specification (e.g. I think Variant and
>>>>>> Geography fall into this category).
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I think this all sounds good, the real question is whether or not we
>>>>>>> have someone to actively work on the proposals. I think for things like
>>>>>>> Default Values and Geo Types we have folks actively working on them so 
>>>>>>> it's
>>>>>>> not a big deal.
>>>>>>>

Re: Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-13 Thread Szehon Ho
Congratulations all, very well deserved!

Thanks
Szehon

On Tue, Aug 13, 2024 at 10:25 PM Russell Spitzer 
wrote:

> Hi Y'all,
>
> It is my pleasure to let everyone know that the Iceberg PMC has voted to
> have several talented individuals join us.
>
> So without further ado, please welcome Péter Váry, Amogh Jahagirdar and
> Eduard Tudenhoefner to the Apache Iceberg PMC.
>
> As usual I am excited about the future of this community and thankful for
> the hard work and stewardship of its members.
>
> Thank you for your time,
> Russell Spitzer
>


Re: [Discuss] Geospatial Support

2024-08-20 Thread Szehon Ho
Hi all

Please take a look at the proposed spec change to support Geo type for V3
in : https://github.com/apache/iceberg/pull/10981, and comment or otherwise
let me know your thoughts.

Just as an FYI it incorporated the feedback from our last meeting (with
Snowflake and Wherobots engineers).

Thanks,
Szehon

On Wed, Jun 26, 2024 at 7:29 PM Szehon Ho  wrote:

> Hi
>
> It was great to meet in person with Snowflake engineers and we had a good
> discussion on the paths forward.
>
> Meeting notes for Snowflake- Iceberg sync.
>
>- Iceberg proposed Geometry type defaults to (edges=planar ,
>crs=CRS84).
>- Snowflake has two types Geography (spherical) and Geometry (planar,
>with customizable CRS).  The data layout/encoding is the same for both
>types.  Let's see how we can support each in Iceberg type, especially wrt
>Iceberg partition/file pruning
>- Geography type support
>- Main concern is the need for a suitable partition transform for
>   partition-level filter, the candidate is Micahel Entin's proposal
>   
> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
>   .
>   - Secondary concern is file and RG-level filtering.  Gang's Parquet
>   proposal <https://github.com/apache/parquet-format/pull/240/files> allow
>   storage of S2 / H3 ID's in Parquet stats, and so we can also leverage 
> that
>   in Iceberg pruning code (Google and Uber libraries are compatible)
>- Geometry type support
>   -  Main concern is partition transform needs to understand CRS, but
>   this can be solved by having XZ2 transform created with customizable
>   min/max lat/long range (its all it needs)
>- Should (CRS, edges) be stored properties on Geography type in Phase
>1?
>   - Should be fine to store, with only allowing defaults in Phase 1.
>   - Concern 1: If edges is stored, there will be ask to store other
>   properties like (orientation, epoch).  Solution is to punt these 
> follow-on
>   properties for later.
>   - Concern 2: if crs is stored, what format?  PROJJSON vs SRID.
>   Solution is to leave it as a string
>   - Concern 3: if crs is stored as a string, Iceberg cannot read it.
>   This should be ok, as we only need this for XZ2 transform, where the 
> user
>   already passes in the info from CRS (up to user to make sure these 
> align).
>
> Thanks
> Szehon
>
> On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho 
> wrote:
>
>> Jia and I will sync with the Snowflake folks to see if we can have a
>> solution, or roadmap to solution, in the proposal.
>>
>> Thanks JB for the interest!  By the way, I want to schedule a meeting to
>> go over the proposal, it seems there's good feedback from folks from geo
>> side (and even Parquet community), but not too many eyes/feedback from
>> other folks/PMC on Iceberg community.  This might be due to lack of
>> familiarity/ time to read through it all.  In fact, a lot of the advanced
>> discussions like this one are for Phase 2 items, and Phase 1 items are
>> relatively straightforward, so wanted to explain that.  As I know its
>> summer vacation for some folks, we can do this in a week or early July,
>> hope that sounds good with everyone.
>>
>> Thanks,
>> Szehon
>>
>> On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Jia
>>>
>>> Thanks for the update. I'm gonna re-read the whole thread and document
>>> to have a better understanding.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu  wrote:
>>>
>>>> Hi Snowflake folks,
>>>>
>>>> Please let me know if you have other questions regarding the proposal.
>>>> If any, Szehon and I can set up a zoom call with you guys to clarify some
>>>> details. We are in the Pacific time zone. If you are in Europe, maybe early
>>>> morning Pacific Time works best for you?
>>>>
>>>> Thanks,
>>>> Jia
>>>>
>>>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu  wrote:
>>>>
>>>>> > The min/max stats are discussed in the doc (Phase 2), depending on
>>>>> the non-trivial encoding.
>>>>>
>>>>> Just want to add that min/max stats filtering could be supported by
>>>>> file format natively. Adding geometry type to parquet spec
>>>>> is under discussion: https://github.com/apache/parquet-format/pull/240
>>>>>
>>>>> Best

Re: Welcoming Yan Yan as a new committer!

2021-03-24 Thread Szehon Ho
Nice, congratulations!

> On 24 Mar 2021, at 11:37, Marton Bod  wrote:
> 
> Congratulations, well done!
> 
> On Wed, 24 Mar 2021 at 11:32, Peter Vary  wrote:
> Congratulations Yan!
> 
>> On Mar 24, 2021, at 05:43, Yufei Gu > > wrote:
>> 
>> Congratulations, Yan!
>> 
>> Best,
>> 
>> Yufei
>> 
>> `This is not a contribution`
>> 
>> 
>> On Tue, Mar 23, 2021 at 8:44 PM Russell Spitzer > > wrote:
>> Congratulations!
>> 
>>> On Mar 23, 2021, at 9:35 PM, OpenInx >> > wrote:
>>> 
>>> Congrats Yan !   You deserve it.
>>> 
>>> On Wed, Mar 24, 2021 at 7:18 AM Miao Wang >> > wrote:
>>> Congrats @Yan Yan !
>>> 
>>>  
>>> 
>>> Miao
>>> 
>>>  
>>> 
>>> From: Ryan Blue mailto:b...@apache.org>>
>>> Reply-To: "dev@iceberg.apache.org " 
>>> mailto:dev@iceberg.apache.org>>
>>> Date: Tuesday, March 23, 2021 at 3:43 PM
>>> To: Iceberg Dev List >> >
>>> Subject: Welcoming Yan Yan as a new committer!
>>> 
>>>  
>>> 
>>> Hi everyone,
>>> 
>>> I'd like to welcome Yan Yan as a new Iceberg committer.
>>> 
>>> Thanks for all your contributions, Yan!
>>> 
>>> rb
>>> 
>>>  
>>> 
>>> --
>>> 
>>> Ryan Blue
>>> 
>> 
> 



Re: Welcoming Ryan Murray as a new committer!

2021-03-29 Thread Szehon Ho
That’s awesome, great work Ryan. 

Szehon

> On 29 Mar 2021, at 18:08, Anton Okolnychyi  
> wrote:
> 
> Hey folks,
> 
> I’d like to welcome Ryan Murray as a new committer to the project!
> 
> Thanks for all the hard work, Ryan!
> 
> - Anton



Re: Welcoming Russell Spitzer as a new committer

2021-03-29 Thread Szehon Ho
Awesome, well-deserved, Russell!

Szehon

> On 29 Mar 2021, at 18:10, Holden Karau  wrote:
> 
> Congratulations Russel!
> 
> On Mon, Mar 29, 2021 at 9:10 AM Anton Okolnychyi 
>  wrote:
> Hey folks,
> 
> I’d like to welcome Russell Spitzer as a new committer to the project!
> 
> Thanks for all your contributions, Russell!
> 
> - Anton
> -- 
> Twitter: https://twitter.com/holdenkarau 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau 
> 


Re: Spark configuration on hive catalog

2021-04-22 Thread Szehon Ho
Hi Huadong, nice to see you again :).  The syntax is spark-sql is ‘insert into 
.. …”, here you defined your db as a catalog?  

You just need to define one catalog and use it when referring to your table.



> On 22 Apr 2021, at 07:34, Huadong Liu  wrote:
> 
> Hello Iceberg Dev,
> 
> I am not sure I follow the discussion on Spark configurations on hive 
> catalogs . I 
> created an iceberg table with the hive catalog.
> Configuration conf = new Configuration();
> conf.set("hive.metastore.uris", args[0]);
> conf.set("hive.metastore.warehouse.dir", args[1]);
> 
> HiveCatalog catalog = new HiveCatalog(conf);
> ImmutableMap meta = ImmutableMap.of(...);
> Schema schema = new Schema(...);
> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
> 
> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
> Table table = catalog.createTable(name, schema, spec);
> On a box with hive.metastore.uris set correctly in hive-site.xml, spark-sql 
> runs fine with 
> 
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), 
> ("333", timestamp 'today', 3);
> spark-sql> SELECT * FROM my_db.my_table ;
> 
> However, if I follow the Spark hive configuration above to add a table 
> catalog,
> 
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive 
> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog 
> --conf spark.sql.catalog.my_db.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), 
> ("333", timestamp 'today', 3);
> Error in query: Table not found: my_db.my_table;
> 
> https://iceberg.apache.org/spark/#reading-an-iceberg-table 
>  states that "To 
> use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand 
> anything? Do I have to configure catalog/namespace? Thanks for your time on 
> this.
> 
> --
> Huadong



Re: Welcoming OpenInx as a new PMC member!

2021-06-29 Thread Szehon Ho
Congrats Zheng!

> On 29 Jun 2021, at 14:02, Anton Okolnychyi  
> wrote:
> 
> Well deserved! Congrats!
> 
>> On 29 Jun 2021, at 13:56, Jack Ye > > wrote:
>> 
>> Congratulations!!!
>> 
>> On Tue, Jun 29, 2021 at 1:55 PM Ryan Murray > > wrote:
>> Congrats!!
>> 
>> On Tue, Jun 29, 2021 at 10:53 PM Russell Spitzer > > wrote:
>> Congratulations!
>> 
>> > On Jun 29, 2021, at 3:52 PM, Ryan Blue > > > wrote:
>> > 
>> > Hi everyone,
>> > 
>> > I'd like to welcome OpenInx (Zheng Hu) as a new Iceberg PMC member.
>> > 
>> > Thanks for all your contributions and commitment to the project, OpenInx!
>> > 
>> > Ryan
>> > 
>> > -- 
>> > Ryan Blue
>> 
> 



Re: Welcoming Jack Ye as a new committer!

2021-07-05 Thread Szehon Ho
Congratulations Jack!

> On 5 Jul 2021, at 16:53, Jun H.  wrote:
> 
> Congratulations!
> 
> 
>> On Jul 5, 2021, at 4:14 PM, Russell Spitzer  
>> wrote:
>> 
>> 
>> Congratulations!
>> 
>> On Mon, Jul 5, 2021 at 3:21 PM karuppayya > > wrote:
>> Congratulations Jack!
>> 
>> On Mon, Jul 5, 2021 at 1:14 PM Yufei Gu > > wrote:
>> Congratulations, Jack! Thanks for the contribution!
>> 
>> Best,
>> 
>> Yufei
>> 
>> 
>> On Mon, Jul 5, 2021 at 1:09 PM John Zhuge > > wrote:
>> Congratulations Jack!
>> 
>> On Mon, Jul 5, 2021 at 12:57 PM Marton Bod  wrote:
>> Congrats Jack!
>> 
>> On Mon, Jul 5, 2021 at 9:54 PM Wing Yew Poon  
>> wrote:
>> Congratulations Jack!
>> 
>> 
>> On Mon, Jul 5, 2021 at 11:35 AM Ryan Blue > > wrote:
>> Hi everyone,
>> 
>> I'd like to welcome Jack Ye as a new Iceberg committer.
>> 
>> Thanks for all your contributions, Jack!
>> 
>> Ryan
>> 
>> -- 
>> Ryan Blue
>> -- 
>> John Zhuge



Re: Iceberg 0.12.0 Release Plan

2021-07-19 Thread Szehon Ho
Hi Carl,

For the Issue: https://github.com/apache/iceberg/issues/2783

The status is: I gave a bit of a try but couldn’t find an easy fix, so
hoping someone more knowledgable about this code has cycle to take a look
at it.

It would be great to fix it for 0.12 as it seems to block more metadata
queries than before, but for timing purpose I’m not sure if its feasible.

Thanks
Szehon

On Mon, Jul 19, 2021 at 2:19 PM Carl Steinbach  wrote:

> Hi Everyone,
>
> Currently, there are three issues blocking the release of 0.12.0:
>
>
>1. #2308 Handle the case that RewriteFiles and RowDelta commit the
>transaction at the same time
><https://github.com/apache/iceberg/issues/2308>
>2. #2783 Metadata Table Empty Projection - Unknown type for int field.
>Type name: java.lang.string
><https://github.com/apache/iceberg/issues/2783>
>3. #2284 Core: reassign the partition field IDs and reuse any existing
>ID <https://github.com/apache/iceberg/pull/2284>s
>
> #2284 is in review.
>
> Ryan said he would take a look at #2308.
>
> @Szehon Ho , can you please confirm whether or not
> you're working on #2783?
>
> Thanks.
>
> - Carl
>
>
>
> On Mon, Jul 19, 2021 at 12:31 PM Jack Ye  wrote:
>
>> I haven't heard any news for the 0.12.0 release since then, are we still
>> planning for the release?
>>
>> Please let me know if there is anything we can do to help speed up the
>> process. (I just saw the release board, will try to at least review those
>> PRs)
>>
>> Best,
>> Jack Ye
>>
>> On Mon, Jul 12, 2021 at 5:41 PM Sreeram Garlapati <
>> gsreeramku...@gmail.com> wrote:
>>
>>> Great, thanks Ryan.
>>>
>>> On Mon, Jul 12, 2021 at 5:17 PM Ryan Blue  wrote:
>>>
>>>> Sreeram, I was just waiting for tests to pass on that PR. I just merged
>>>> it.
>>>>
>>>> On Mon, Jul 12, 2021 at 4:41 PM Sreeram Garlapati <
>>>> gsreeramku...@gmail.com> wrote:
>>>>
>>>>> Hi Carl,
>>>>>
>>>>> Thanks a lot for managing 0.12.0 release.
>>>>>
>>>>> Can you also pl. add this PR:
>>>>> https://github.com/apache/iceberg/pull/2752 - which adds the option "
>>>>> streaming-skip-delete-snapshots" - to Spark3 micro_batch reader.
>>>>> Without this, streaming reads will fail if a snapshot of type delete or
>>>>> replace is encountered, & is pretty much unusable. This PR is already
>>>>> approved by multiple Committers - Ryan and Russell.
>>>>>
>>>>> PS: I am unsure if new PRs will be merged apart from the list proposed
>>>>> on the project board - into the *0.12.0* release, and hence,
>>>>> proposing this. If this PR will be merged - no action is needed. pl. 
>>>>> pardon
>>>>> my ignorance.
>>>>>
>>>>> Best regards,
>>>>> Sreeram
>>>>>
>>>>> On Mon, Jul 12, 2021 at 4:14 PM Carl Steinbach  wrote:
>>>>>
>>>>>> Hi Grant,
>>>>>>
>>>>>> Good catch! I added PR-1648
>>>>>> <https://github.com/apache/iceberg/pull/1648> to the 0.12.0 project
>>>>>> board.
>>>>>>
>>>>>> - Carl
>>>>>>
>>>>>> On Mon, Jul 12, 2021 at 1:16 PM Grant Nicholas 
>>>>>> wrote:
>>>>>>
>>>>>>> Howdy! Any updates on PR-1648
>>>>>>> <https://github.com/apache/iceberg/pull/1648> which upgrades the
>>>>>>> avro version used in iceberg? I do not see it in the "To-Do" section 
>>>>>>> linked
>>>>>>> above, but the older avro version has caused major problems described in
>>>>>>> this issue and it would be nice to get in 0.12.0.
>>>>>>> https://github.com/apache/iceberg/issues/1654
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 12, 2021 at 2:39 PM Carl Steinbach 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>>
>>>>>>>> I volunteered to be the release manager for the 0.12.0 release. My
>>>>>>>> goal is to start cutting release candidates this Friday, 7/17 at 24:00 
>>>>>>>> PST.
>>>>&

Serializable isolation for insert overwrites?

2021-07-20 Thread Szehon Ho
Hi,

Does anyone know if its feasible to consider making Spark's "insert
overwrite" implement serializable transaction, like delete, update, merge?

Maybe at least for "overwrite by filter", then it can narrow down the
conflict checks needed on the commitWithSerializableTransaction side.  I
don't have the full context on the Spark side if its feasible to do the
rewrite as Delete/Merge/Update does, to use this mechanism.

Its for a use-case like "insert overwrite into table foo partition
(date=...) select ... from foo", which I understand is not the common use
case for insert overwrite, as its usually select from another table.

Thanks in advance,
Szehon


Re: Serializable isolation for insert overwrites?

2021-07-20 Thread Szehon Ho
Thanks Ryan for the confirmation, I'm definitely interested to take a
look.  if it can be done, the serializable isolation level could probably
be an option as for the other operations.  I will look a bit and ping you
when I get a chance.

Szehon

On Tue, Jul 20, 2021 at 5:11 PM Ryan Blue  wrote:

> Szehon,
>
> We implemented the current behavior because that’s what was expected for 
> INSERT
> OVERWRITE. But the ReplacePartitions operation uses the same base class
> as the expression overwrite, so you could add more validation, including
> the conflict checks that you’re talking about by calling the
> validateAddedDataFiles helper method
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L249-L250>
> on the base class.
>
> If you want to implement this, ping me on Slack and I can point you in the
> right direction.
>
> Ryan
>
> On Tue, Jul 20, 2021 at 4:20 PM Szehon Ho  wrote:
>
>> Hi,
>>
>> Does anyone know if its feasible to consider making Spark's "insert
>> overwrite" implement serializable transaction, like delete, update, merge?
>>
>> Maybe at least for "overwrite by filter", then it can narrow down the
>> conflict checks needed on the commitWithSerializableTransaction side.  I
>> don't have the full context on the Spark side if its feasible to do the
>> rewrite as Delete/Merge/Update does, to use this mechanism.
>>
>> Its for a use-case like "insert overwrite into table foo partition
>> (date=...) select ... from foo", which I understand is not the common use
>> case for insert overwrite, as its usually select from another table.
>>
>> Thanks in advance,
>> Szehon
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-05 Thread Szehon Ho
+1 (non-binding)

* Verify Signature Keys
* Verify Checksum
* dev/check-license
* Build
* Run tests (though some timeout failures, on Hive MR test..)

Thanks
Szehon

On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks  wrote:

> +1 (binding)
>
> I verified sigs/sums, license, build, and test
>
> -Dan
>
> On Wed, Aug 4, 2021 at 2:53 PM Ryan Murray  wrote:
>
>> After some wrestling w/ Spark I discovered that the problem was with my
>> test. Some SparkSession apis changed. so all good here now.
>>
>> +1 (non-binding)
>>
>> On Wed, Aug 4, 2021 at 11:29 PM Ryan Murray  wrote:
>>
>>> Thanks for the help Carl, got it sorted out. The gpg check now works.
>>> For those who were interested I used a canned wget command in my history
>>> and it pulled the RC0 :-)
>>>
>>> Will have a PR to fix the Nessie Catalog soon.
>>>
>>> Best,
>>> Ryan
>>>
>>> On Wed, Aug 4, 2021 at 9:21 PM Carl Steinbach 
>>> wrote:
>>>
 Hi Ryan,

 Can you please run the following command to see which keys in your
 public keyring are associated with my UID?

 % gpg  --list-keys c...@apache.org
 pub   rsa4096/5A5C7F6EB9542945 2021-07-01 [SC]
   160F51BE45616B94103ED24D5A5C7F6EB9542945
 uid [ultimate] Carl W. Steinbach (CODE SIGNING KEY) <
 c...@apache.org>
 sub   rsa4096/4158EB8A4F03D2AA 2021-07-01 [E]

 Thanks.

 - Carl

 On Wed, Aug 4, 2021 at 11:12 AM Ryan Murray  wrote:

> Hi all,
>
> Unfortunately I have to give -1
>
> I had trouble w/ the keys:
>
> gpg: assuming signed data in 'apache-iceberg-0.12.0.tar.gz'
> gpg: Signature made Mon 02 Aug 2021 03:36:30 CEST
> gpg:using RSA key
> FAFEB6EAA60C95E2BB5E26F01FF0803CB78D539F
> gpg: Can't check signature: No public key
>
> And I have discovered a bug in NessieCatalog. It is unclear what is
> wrong but the NessieCatalog doesn't play nice w/ Spark3.1. I will raise a
> patch ASAP to fix it. Very sorry for the inconvenience.
>
> Best,
> Ryan
>
> On Wed, Aug 4, 2021 at 3:20 AM Carl Steinbach  wrote:
>
>> Hi everyone,
>>
>> I propose that we release RC2 as the official Apache Iceberg 0.12.0
>> release. Please note that RC0 and RC1 were DOA.
>>
>> The commit id for RC2 is 7c2fcfd893ab71bee41242b46e894e6187340070
>> * This corresponds to the tag: apache-iceberg-0.12.0-rc2
>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.12.0-rc2
>> *
>> https://github.com/apache/iceberg/tree/7c2fcfd893ab71bee41242b46e894e6187340070
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven
>> repository URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1017/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.12.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>


Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-09 Thread Szehon Ho
If it’s easy, would it make sense to include Russell’s fix as well for Metadata 
tables query , as it affects Spark 3.1 (a regression from Spark 3.0)?  
https://github.com/apache/iceberg/pull/2877/files

The issue : https://github.com/apache/iceberg/issues/2783 was at some point 
marked for 0.12 release.  I had mentioned it’s ok to remove, if it takes too 
long to fix, and now it is indeed fixed.

Thanks,
Szehon

 

> On 9 Aug 2021, at 11:36, Ryan Blue  wrote:
> 
> Thanks for pointing that one out, Jack! That would be good to get in as well.
> 
> On Mon, Aug 9, 2021 at 11:02 AM Jack Ye  <mailto:yezhao...@gmail.com>> wrote:
> If we are considering recutting the branch, please also include this PR 
> https://github.com/apache/iceberg/pull/2943 
> <https://github.com/apache/iceberg/pull/2943> which fixes the validation when 
> creating a schema with identifier fields, thank you!
> 
> -Jack Ye
> 
> On Mon, Aug 9, 2021 at 9:08 AM Wing Yew Poon  
> wrote:
> Ryan,
> Thanks for the review. Let me look into implementing your refactoring 
> suggestion.
> - Wing Yew
> 
> 
> On Mon, Aug 9, 2021 at 8:41 AM Ryan Blue  <mailto:b...@tabular.io>> wrote:
> Yeah, I agree. We should fix this for the 0.12.0 release. That said, I plan 
> to continue testing this RC because it won't change that much since this 
> affects the Spark extensions in 3.1. Other engines and Spark 3.0 or older 
> should be fine.
> 
> I left a comment on the PR. I think it looks good, but we should try to 
> refactor to make sure we don't have more issues like this. I think when we 
> update our extensions to be compatible with multiple Spark versions, we 
> should introduce a factory method to create the Catalyst plan node and use 
> that everywhere. That will hopefully cut down on the number of times this 
> happens.
> 
> Thank you, Wing Yew!
> 
> On Sun, Aug 8, 2021 at 2:52 PM Carl Steinbach  <mailto:cwsteinb...@gmail.com>> wrote:
> Hi Wing Yew,
> 
> I will create a new RC once this patch is committed.
> 
> Thanks.
> 
> - Carl
> 
> On Sat, Aug 7, 2021 at 4:29 PM Wing Yew Poon  
> wrote:
> Sorry to bring this up so late, but this just came up: there is a Spark 3.1 
> (runtime) compatibility issue (not found by existing tests), which I have a 
> fix for in https://github.com/apache/iceberg/pull/2954 
> <https://github.com/apache/iceberg/pull/2954>. I think it would be really 
> helpful if it can go into 0.12.0.
> - Wing Yew
> 
> 
> On Fri, Aug 6, 2021 at 11:36 AM Jack Ye  <mailto:yezhao...@gmail.com>> wrote:
> +1 (non-binding)
> 
> Verified release test and AWS integration test, issue found in test but not 
> blocking for release (https://github.com/apache/iceberg/pull/2948 
> <https://github.com/apache/iceberg/pull/2948>)
> 
> Verified Spark 3.1 and 3.0 operations and new SQL extensions and procedures 
> on EMR.
> 
> Thanks,
> Jack Ye
> 
> On Fri, Aug 6, 2021 at 1:19 AM Kyle Bendickson  <mailto:kjbendick...@gmail.com>> wrote:
> +1 (binding)
> 
> I verified:
>  - KEYS signature & checksum
>  - ./gradlew clean build (tests, etc) 
>  - Ran Spark jobs on Kubernetes after building from the tarball at  
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/ 
> <https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/>
>  - Spark 3.1.1 batch jobs against both Hadoop and Hive tables, using HMS 
> for Hive catalog
>  - Verified default FileIO and S3FileIO
>  - Basic read and writes
>  - Jobs using Spark procedures (remove unreachable files)
>  - Special mention: verified that Spark catalogs can override hadoop 
> configurations using configs prefixed with 
> "spark.sql.catalog.(catalog-name).hadoop."
>  - one of my contributions to this release that has been asked about by 
> several customers internally
>  - tested using `spark.sql.catalog.(catalog-name).hadoop.fs.s3a.impl` for 
> two catalogs, both values respected as opposed to the default globally 
> configured value
> 
> Thank you Carl!
> 
> - Kyle, Data OSS Dev @ Apple =)
> 
> On Thu, Aug 5, 2021 at 11:49 PM Szehon Ho  <mailto:szehon.apa...@gmail.com>> wrote:
> +1 (non-binding)
> 
> * Verify Signature Keys
> * Verify Checksum
> * dev/check-license
> * Build
> * Run tests (though some timeout failures, on Hive MR test..) 
> 
> Thanks
> Szehon
> 
> On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks  <mailto:dwe...@apache.org>> wrote:
> +1 (binding)
> 
> I verified sigs/sums, license, build, and test
> 
> -Dan
> 
> On Wed, Aug 4, 2021 at 2:53 PM Ryan Murray  <mailto:rym...@gmail.com>> wrote:
> Aft

Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-09 Thread Szehon Ho
Got it, I somehow thought changes were manually cherry-picked, thanks for 
clarification.

Thanks
Szehon

> On 9 Aug 2021, at 13:34, Ryan Blue  wrote:
> 
> Szehon, I think that should make it because the RC will come from master.
> 
> On Mon, Aug 9, 2021 at 12:56 PM Szehon Ho  wrote:
> If it’s easy, would it make sense to include Russell’s fix as well for 
> Metadata tables query , as it affects Spark 3.1 (a regression from Spark 
> 3.0)?  https://github.com/apache/iceberg/pull/2877/files 
> <https://github.com/apache/iceberg/pull/2877/files>
> 
> The issue : https://github.com/apache/iceberg/issues/2783 
> <https://github.com/apache/iceberg/issues/2783> was at some point marked for 
> 0.12 release.  I had mentioned it’s ok to remove, if it takes too long to 
> fix, and now it is indeed fixed.
> 
> Thanks,
> Szehon
> 
>  
> 
>> On 9 Aug 2021, at 11:36, Ryan Blue > <mailto:b...@tabular.io>> wrote:
>> 
>> Thanks for pointing that one out, Jack! That would be good to get in as well.
>> 
>> On Mon, Aug 9, 2021 at 11:02 AM Jack Ye > <mailto:yezhao...@gmail.com>> wrote:
>> If we are considering recutting the branch, please also include this PR 
>> https://github.com/apache/iceberg/pull/2943 
>> <https://github.com/apache/iceberg/pull/2943> which fixes the validation 
>> when creating a schema with identifier fields, thank you!
>> 
>> -Jack Ye
>> 
>> On Mon, Aug 9, 2021 at 9:08 AM Wing Yew Poon > <mailto:wyp...@cloudera.com.invalid>> wrote:
>> Ryan,
>> Thanks for the review. Let me look into implementing your refactoring 
>> suggestion.
>> - Wing Yew
>> 
>> 
>> On Mon, Aug 9, 2021 at 8:41 AM Ryan Blue > <mailto:b...@tabular.io>> wrote:
>> Yeah, I agree. We should fix this for the 0.12.0 release. That said, I plan 
>> to continue testing this RC because it won't change that much since this 
>> affects the Spark extensions in 3.1. Other engines and Spark 3.0 or older 
>> should be fine.
>> 
>> I left a comment on the PR. I think it looks good, but we should try to 
>> refactor to make sure we don't have more issues like this. I think when we 
>> update our extensions to be compatible with multiple Spark versions, we 
>> should introduce a factory method to create the Catalyst plan node and use 
>> that everywhere. That will hopefully cut down on the number of times this 
>> happens.
>> 
>> Thank you, Wing Yew!
>> 
>> On Sun, Aug 8, 2021 at 2:52 PM Carl Steinbach > <mailto:cwsteinb...@gmail.com>> wrote:
>> Hi Wing Yew,
>> 
>> I will create a new RC once this patch is committed.
>> 
>> Thanks.
>> 
>> - Carl
>> 
>> On Sat, Aug 7, 2021 at 4:29 PM Wing Yew Poon > <mailto:wyp...@cloudera.com.invalid>> wrote:
>> Sorry to bring this up so late, but this just came up: there is a Spark 3.1 
>> (runtime) compatibility issue (not found by existing tests), which I have a 
>> fix for in https://github.com/apache/iceberg/pull/2954 
>> <https://github.com/apache/iceberg/pull/2954>. I think it would be really 
>> helpful if it can go into 0.12.0.
>> - Wing Yew
>> 
>> 
>> On Fri, Aug 6, 2021 at 11:36 AM Jack Ye > <mailto:yezhao...@gmail.com>> wrote:
>> +1 (non-binding)
>> 
>> Verified release test and AWS integration test, issue found in test but not 
>> blocking for release (https://github.com/apache/iceberg/pull/2948 
>> <https://github.com/apache/iceberg/pull/2948>)
>> 
>> Verified Spark 3.1 and 3.0 operations and new SQL extensions and procedures 
>> on EMR.
>> 
>> Thanks,
>> Jack Ye
>> 
>> On Fri, Aug 6, 2021 at 1:19 AM Kyle Bendickson > <mailto:kjbendick...@gmail.com>> wrote:
>> +1 (binding)
>> 
>> I verified:
>>  - KEYS signature & checksum
>>  - ./gradlew clean build (tests, etc) 
>>  - Ran Spark jobs on Kubernetes after building from the tarball at  
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/ 
>> <https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/>
>>  - Spark 3.1.1 batch jobs against both Hadoop and Hive tables, using HMS 
>> for Hive catalog
>>  - Verified default FileIO and S3FileIO
>>  - Basic read and writes
>>  - Jobs using Spark procedures (remove unreachable files)
>>  - Special mention: verified that Spark catalogs can override hadoop 
>> configurations using configs prefixed with 
>> "spark.sql.catalog.(catalog-name).hadoop."
>>  - o

Re: Subject: [VOTE] Release Apache Iceberg 0.12.0 RC3

2021-08-10 Thread Szehon Ho
+1 (non binding)

* Checked Signature Keys
* Verified Checksum
* Rat checks
* Build and run tests, most functionality pass (also timeout errors on
Hive-MR)

Thanks
Szehon

On Tue, Aug 10, 2021 at 1:40 AM Ryan Murray  wrote:

> +1 (non-binding)
>
> * Verify Signature Keys
> * Verify Checksum
> * dev/check-license
> * Build
> * Run tests (though some timeout failures, on Hive MR test..)
> * ran with Nessie in spark 3.1 and 3.0
>
> On Tue, Aug 10, 2021 at 4:21 AM Carl Steinbach  wrote:
>
>> Hi Everyone,
>>
>> I propose the following RC to be released as the official Apache Iceberg
>> 0.12.0 release.
>>
>> The commit ID is 7ca1044655694dbbab660d02cef360ac1925f1c2
>> * This corresponds to the tag: apache-iceberg-0.12.0-rc3
>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.12.0-rc3
>> *
>> https://github.com/apache/iceberg/tree/7ca1044655694dbbab660d02cef360ac1925f1c2
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc3/
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1018/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.12.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>


Re: Iceberg python library sync

2021-08-12 Thread Szehon Ho
+1, would love to listen in as well

Thanks,
Szehon

> On 12 Aug 2021, at 12:48, Arthur Wiedmer  
> wrote:
> 
> Hi Jun,
> 
> Please add me as well!
> 
> Best,
> Arthur
> 
> 
> 
> On Thu, Aug 12, 2021 at 12:19 AM Jun H.  > wrote:
> Hi everyone,
> 
> Since early this year, we have started working on the iceberg python library 
> to bring it up to date and support the new V2 spec. Here is a summary of the 
> current feature plan 
> .
>  We have a lot of interesting work to do.
> 
> To keep the community in sync, we plan to set up a recurring iceberg python 
> library sync meeting. Please let me know if you are interested in or have any 
> questions.
> 
> Thanks.
> 
> Jun
> 



Re: [DISCUSS] Iceberg roadmap

2021-09-10 Thread Szehon Ho
Hi

I also missed the last sync, and wanted to add two things if possible.

Thanks,
Szehon

Priority 2:

   - Core: Predicate pushdown for remaining Metadata tables [medium]
   - Core/Spark: Support serializable isolation for ReplacePartitions /
   Insert Overwrite [medium]


On Fri, Sep 10, 2021 at 4:40 PM Steven Wu  wrote:

> I would like to add a item
>
> Priority 2:
> Flink: FLIP-27 based Iceberg source [large]
>
> On Fri, Sep 10, 2021 at 2:38 PM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> At the last sync meeting, we brought up publishing a community roadmap
>> and brainstormed the many features and initiatives that the community is
>> working on. In this thread, I want to make sure that we have a good list of
>> what people are thinking about and I think we should try to categorize the
>> projects by size and general priority. When we reach a rough agreement,
>> I’ll write this up and post it on the ASF site along with links to some
>> projects in Github.
>>
>> My rationale for attempting to prioritize projects is that if we try to
>> do too many things, it will be slower progress across everything rather
>> than getting a few important items done. I know that priorities don’t align
>> very cleanly in practice, but it is hopefully worth trying. To come up with
>> a priority, I’m trying to keep top priority items to a minimum by including
>> only one from each group (Spark, Flink, Python, etc.). The remaining items
>> are split between priority 2 and 3. Priority 3 is not urgent, including
>> things that can be plugged in (like other IO libraries), docs, etc.
>> Everything else is priority 2.
>>
>> That something isn’t priority 1 doesn’t mean it isn’t important or
>> progressing, just that it isn’t the current focus. I think of it this way:
>> if someone has extra time to review something, what should be next? That’s
>> top priority.
>>
>> Here’s my rough categorization. If you disagree, please speak up:
>>
>>- If you think that something should be top priority, what gets moved
>>to priority 2?
>>- Should the priority for a project in 2 or 3 change?
>>- Is the S/M/L size of a project wrong?
>>
>> Top priority, 1:
>>
>>- API: Iceberg 1.0 [medium]
>>- Spark: Merge-on-read plans [large]
>>- Maintenance: Delete file compaction [medium]
>>-
>>
>>Flink: Upgrade to 1.13.2 (document compatibility) [medium]
>>-
>>
>>Python: Pythonic refactor [medium]
>>
>> Priority 2:
>>
>>- ORC: Support delete files stored as ORC [small]
>>- Spark: DSv2 streaming improvements [small]
>>- Flink: Inline file compaction [small]
>>- Flink: Support UPSERT [small]
>>- Views: Spec [medium]
>>- Spec: Z-ordering / Space-filling curves [medium]
>>- Spec: Snapshot tagging and branching [small]
>>- Spec: Secondary indexes [large]
>>- Spec v3: Encryption [large]
>>-
>>
>>Spec v3: Relative paths [large]
>>-
>>
>>Spec v3: Default field values [medium]
>>
>> Priority 3:
>>
>>- Docs: versioned docs [medium]
>>- IO: Support Aliyun OSS/DLF [medium]
>>- IO: Support Dell ECS [medium]
>>
>> External:
>>
>>- Trino: Bucketed joins [small]
>>- Trino: Row-level delete support [medium]
>>- Trino: Merge-on-read plans [medium]
>>- Trino: Multi-catalog support [small]
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Welcome new PMC members!

2021-11-18 Thread Szehon Ho
Awesome, congratulations Jack and Russell!

> On 18 Nov 2021, at 09:30, Ryan Murray  wrote:
> 
> Congratulations both! Well deserved!
> 
> On Thu, 18 Nov 2021, 09:19 Omar Al-Safi,  > wrote:
> Congrats both of you!
> 
> On Thu, Nov 18, 2021 at 8:31 AM Eduard Tudenhoefner  > wrote:
> Congrats Jack and Russell! Very well deserved.
> 
> On Thu, Nov 18, 2021, 01:12 Ryan Blue  > wrote:
> Hi everyone, I want to welcome Jack Ye and Russell Spitzer to the Iceberg 
> PMC. They've both been amazing at reviewing and helping people in the 
> community and the PMC has decided to invite them to join. Congratulations, 
> Jack and Russell! Thank you for all your hard work and support for the 
> project.
> 
> Ryan
> 
> -- 
> Ryan Blue



Re: Number of entries in manifest-list

2022-01-07 Thread Szehon Ho
Hi,

The manifest entries are one per data file or delete file, so depends how
many data files/delete files your table has.  Number of files is controlled
mostly by the parallelism of the job that writes the table, though there
are Iceberg RewriteDataFile utilities that can compact as well (as in your
link).

The number of manifest files is another topic, controlled by
"commit.manifest.target-size-bytes"
(but should not affect the number of total manifest entries).

Hope that helps,
Szehon

On Fri, Jan 7, 2022 at 9:39 AM g. g. grey  wrote:

> Hi folks,
>
> I am just getting started with Iceberg and I'm trying to build up some
> intuition for how large the metadata will become for large, active tables.
> Specifically, what is the order of magnitude of manifest entries that I
> should reasonably expect in a manifest-list file? Is there a particular
> range that is ideal and aimed for when cleaning up/maintaining a table?
>
> I found the maintenance page ,
> but I'm hoping to find rules-of-thumb based on peoples' experience with
> using iceberg.
>
> Thanks! If I've missed the info somewhere, a simple pointer would be great.
> ggg
>


Re: Number of entries in manifest-list

2022-01-07 Thread Szehon Ho
Sure, I guessed you were asking about the number of manifest files rather
than entries.  There's always a tradeoff, some aspects being:

   - More manifest files => better predicate pushdown (skip more manifest
   files during query), and less chance for concurrency conflict (which is two
   transaction trying to modify same manifest file, which leads to retry).
   - Less manifest files => metadata queries (like show partitions) can be
   faster.

Each of these is a large topic itself that might be too big to go into here
:)

For us, we find the benefit for more manifest file is not as important as
making the metadata query fast for our users.  So we have tuned
commit.manifest.target-size-bytes to be a few times than the default.  We
try to keep the manifest file count to be tens or hundreds for any table,
we find if there are thousands, then a 'show partition' query takes a long
time.

We do need to do periodic RewriteManifest to keep the table in this shape
(as we have too many commits), and also to use
'commit.manifest.min-count-to-merge' and 'commit.manifest-merge.enabled' to
do the merge on commit to keep the table in this shape.

Hope that helps,
Szehon

On Fri, Jan 7, 2022 at 1:10 PM g. g. grey  wrote:

> Hi Szehon,
>
> Thanks. My apologies; I was too loose in my wording. I'll try to use the
> terms from the spec.
>
> I was asking about the number of total manifest files, specifically the
> number of `manifest_file` structs that are found in the manifest-list file.
>
> It sounds like the "commit.manifest.target-size-bytes" controls the target
> size when we merge small manifest files, which is great to know we can
> configure, as it will clearly have an impact on the number of
> `manifest_file` structs.
>
> Is there a general order-of-magnitude target number of `manifest_file`
> structs? Presumably that would dictate when one would want to merge
> manifest files and/or data files.
>
> Thanks again!
> ggg
>
>
> On Fri, Jan 7, 2022 at 11:41 AM Szehon Ho  wrote:
>
>> Hi,
>>
>> The manifest entries are one per data file or delete file, so depends how
>> many data files/delete files your table has.  Number of files is controlled
>> mostly by the parallelism of the job that writes the table, though there
>> are Iceberg RewriteDataFile utilities that can compact as well (as in your
>> link).
>>
>> The number of manifest files is another topic, controlled by 
>> "commit.manifest.target-size-bytes"
>> (but should not affect the number of total manifest entries).
>>
>> Hope that helps,
>> Szehon
>>
>> On Fri, Jan 7, 2022 at 9:39 AM g. g. grey  wrote:
>>
>>> Hi folks,
>>>
>>> I am just getting started with Iceberg and I'm trying to build up some
>>> intuition for how large the metadata will become for large, active tables.
>>> Specifically, what is the order of magnitude of manifest entries that I
>>> should reasonably expect in a manifest-list file? Is there a particular
>>> range that is ideal and aimed for when cleaning up/maintaining a table?
>>>
>>> I found the maintenance page <https://iceberg.apache.org/#maintenance/>,
>>> but I'm hoping to find rules-of-thumb based on peoples' experience with
>>> using iceberg.
>>>
>>> Thanks! If I've missed the info somewhere, a simple pointer would be
>>> great.
>>> ggg
>>>
>>


Re: [VOTE] Release Apache Iceberg 0.13.0 RC2

2022-01-30 Thread Szehon Ho
+1 (non-binding)

Verified signature
Verified checksum
Rat check
Built and ran test, all succeed, after some temporary local HMS timeout
Tested relevant jar with Spark 3.2, created various tables and ran queries

Thanks
Szehon

On Fri, Jan 28, 2022 at 12:19 PM Russell Spitzer 
wrote:

> +1
> All tests passed for me and signatures, checksum and license all were good
> to go
>
> On Fri, Jan 28, 2022 at 12:32 PM John Zhuge  wrote:
>
>> +1 (non-binding)
>>
>> Checked signature, checksum, and license.
>> Ran build and test with OpenJDK 1.8.0_312-b07.
>>
>> Ignoring the mr test failures. Maybe my env is not set up correctly?
>>
>>- 19 in TestHiveIcebergStorageHandlerWithEngine
>>- 3 in TestHiveIcebergStorageHandlerWithMultipleCatalogs
>>
>>
>> On Fri, Jan 28, 2022 at 8:41 AM Jack Ye  wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we release the following RC as the official Apache
>>> Iceberg 0.13.0 release.
>>>
>>> The commit ID is 72237429ba164c054480dcfbdb9fe1c86c04dcda
>>> * This corresponds to the tag: apache-iceberg-0.13.0-rc2
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.13.0-rc2
>>> *
>>> https://github.com/apache/iceberg/tree/72237429ba164c054480dcfbdb9fe1c86c04dcda
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.0-rc2
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged on Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1080/
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.13.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>
>>
>> --
>> John Zhuge
>>
>


Re: Getting last modified timestamp/other stats per partition

2022-02-23 Thread Szehon Ho
Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table,
https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
Iceberg keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from
$my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so
not sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's
physical modified time is not good enough because it's not exactly when it
is 'committed' into Iceberg.   You may have to try a more advanced query on
the snapshots table and manifest-entries table:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s
JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> Hi,
>
>
>
> In Iceberg, is there a way to get the last modified timestamp and other
> stats (e.g. num rows, uncompressed size, compressed size) of the data per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>


Re: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Szehon Ho
>
> 2.   How can we distinguish between snapshots where new data was
> added vs snapshots where compaction was done?
>

Yea, to answer the second question, I forgot to mention there is a field on
Manifest Entries table called 'status' that you can filter on.  It might
not be documented as it's a bit more advanced/internal, but the values are
listed here:
https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30

So you would want to add to the query the filter (e.status =1) , which is
Added, if you only care about when the file is first added.

Thanks
Szehon

On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue  wrote:

> Mayur,
>
> This is one of the reasons why we want to introduce tagging in the format.
> That will allow you to tag snapshots that you want to keep and expire
> intermediate versions.
>
> In general, there is some cost to keeping thousands of snapshots. Those
> are held in the metadata file that gets written each commit, so you end up
> writing a fairly large file. If your commits are infrequent it doens't
> generally make a difference. But if you have commits every minute or so it
> can get in the way.
>
> Tagging will reduce the problem, and moving to change-based commits with
> the REST catalog should also help in the long term.
>
> Ryan
>
> On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> A few follow-up questions for getting last modified time for each
>> partition:
>>
>>
>>
>> 1.   If we want to use snapshots, does this mean we will have to
>> maintain full history of snapshots? E.g. if we partition by method=‘day’
>> and write once a day for a few years, we will end up in maintaining 1000s
>> of snapshots. How does a long history of snapshots affect metadata size,
>> commit performance, etc.? We intend to experiment with this but I’m curious
>> to know if there’s already some recommendation on the amount of history for
>> snapshots.
>>
>> 2.   How can we distinguish between snapshots where new data was
>> added vs snapshots where compaction was done?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Mayur Srivastava 
>> *Sent:* Thursday, February 24, 2022 7:27 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* RE: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Thanks Szehon. I’ll give this a try.
>>
>>
>>
>> *From:* Szehon Ho 
>> *Sent:* Wednesday, February 23, 2022 1:38 PM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Hi
>>
>>
>>
>> Probably the metadata tables can help with this.
>>
>>
>>
>> For the size/num_rows of partitions, you can query the files table,
>> https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
>> Iceberg keeps stats for files, and not necessary partitions).
>>
>>
>>
>> SELECT partition, sum(file_size_in_bytes), sum(record_count) from
>> $my_table.files f GROUP BY f.partition
>>
>>
>>
>> This will be compressed size (again Iceberg keeps file-level stats and so
>> not sure if there are any stats for uncompressed sizes.)
>>
>>
>>
>> For the last modified time, it will be slightly harder.  The file's
>> physical modified time is not good enough because it's not exactly when it
>> is 'committed' into Iceberg.   You may have to try a more advanced query on
>> the snapshots table and manifest-entries table:
>> https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
>>
>>
>>
>> SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots
>> s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
>> e.data_file.partition
>>
>>
>>
>> Hope that helps,
>>
>> Szehon
>>
>>
>>
>> On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Hi,
>>
>>
>>
>> In Iceberg, is there a way to get the last modified timestamp and other
>> stats (e.g. num rows, uncompressed size, compressed size) of the data per
>> partition?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: Welcome Szehon Ho as a committer!

2022-03-11 Thread Szehon Ho
Thanks everyone, I’m very honoured.  It’s so great to see Iceberg growing with 
so much excitement and activity from everybody.

Szehon

> On 11 Mar 2022, at 16:36, Micah Kornfield  wrote:
> 
> Congrats!
> 
> On Friday, March 11, 2022, liwei li  <mailto:hilili...@gmail.com>> wrote:
> Congratulations Szehon!
> 
> Kyle Bendickson mailto:k...@tabular.io>>于2022年3月12日 
> 周六08:10写道:
> Congratulations. Szehon!
> 
> Well deserved!
> 
> On Fri, Mar 11, 2022 at 4:06 PM Steven Wu  <mailto:stevenz...@gmail.com>> wrote:
> Congrat, Szehon!
> 
> On Fri, Mar 11, 2022 at 4:05 PM Chao Sun  <mailto:sunc...@apache.org>> wrote:
> Congratulations Szehon!
> 
> On Fri, Mar 11, 2022 at 4:01 PM OpenInx  <mailto:open...@gmail.com>> wrote:
> >
> > Congrats Szehon!
> >
> > On Sat, Mar 12, 2022 at 7:55 AM Steve Zhang  > <mailto:hongyue_zh...@apple.com>.invalid> wrote:
> >>
> >> Congratulations Szehon, Well done!
> >>
> >> Thanks,
> >> Steve Zhang
> >>
> >>
> >>
> >> On Mar 11, 2022, at 3:51 PM, Jack Ye  >> <mailto:yezhao...@gmail.com>> wrote:
> >>
> >> Congratulations Szehon!!
> >>
> >> -Jack
> >>
> >> On Fri, Mar 11, 2022 at 3:45 PM Wing Yew Poon 
> >>  wrote:
> >>>
> >>> Congratulations Szehon!
> >>>
> >>>
> >>> On Fri, Mar 11, 2022 at 3:42 PM Sam Redai  >>> <mailto:s...@tabular.io>> wrote:
> >>>>
> >>>> Congrats Szehon!
> >>>>
> >>>> On Fri, Mar 11, 2022 at 6:41 PM Yufei Gu  >>>> <mailto:flyrain...@gmail.com>> wrote:
> >>>>>
> >>>>> Congratulations Szehon!
> >>>>> Best,
> >>>>>
> >>>>> Yufei
> >>>>>
> >>>>> `This is not a contribution`
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 11, 2022 at 3:36 PM Ryan Blue  >>>>> <mailto:b...@tabular.io>> wrote:
> >>>>>>
> >>>>>> Congratulations Szehon!
> >>>>>>
> >>>>>> Sorry I accidentally preempted this announcement with the board report!
> >>>>>>
> >>>>>> On Fri, Mar 11, 2022 at 3:32 PM Anton Okolnychyi 
> >>>>>>  wrote:
> >>>>>>>
> >>>>>>> Hey everyone,
> >>>>>>>
> >>>>>>> I would like to welcome Szehon Ho as a new committer to the project!
> >>>>>>>
> >>>>>>> Thanks for all your work, Szehon!
> >>>>>>>
> >>>>>>> - Anton
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ryan Blue
> >>>>>> Tabular
> >>
> >>



Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-28 Thread Szehon Ho
Hi

For gpg verify KEYS i get:
gpg: Can't check signature: No public key

I imported latest keys and do see key for :
uid   Russell Spitzer (CODE SIGNING KEY) 
sub   rsa4096 2022-05-26 [E]

but maybe no public key?  Maybe I am missing something obvious.

Also wanted to ask, can we get this one in as well:
https://github.com/apache/iceberg/pull/4720 : user cannot use projection on
partitions table.

If so, I will make a backport pr.  Sorry it's a regression in 0.13.2, and I
should have noticed and marked it for backport earlier.

Thanks,
Szehon


On Thu, May 26, 2022 at 12:11 AM Eduard Tudenhoefner 
wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 0.13.2 release.
> The commit ID is *f7fd013645823911da116770362463d9df1a54ae*
>
> * This corresponds to the tag: *apache-iceberg-0.13.2-rc0*
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.13.2-rc0
> *
> https://github.com/apache/iceberg/tree/f7fd013645823911da116770362463d9df1a54ae
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.2-rc0
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1085/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-29 Thread Szehon Ho
Yea it does, I ran:

curl -O https://dist.apache.org/repos/dist/dev/iceberg/KEYS
gpg --import KEYS
gpg --verify apache-iceberg-0.13.2.tar.gz.asc

>> gpg: assuming signed data in 'apache-iceberg-0.13.2.tar.gz'
>> gpg: Signature made Wed May 25 18:11:35 2022 PDT
>> gpg:using EDDSA key
0E8C186954C8D6EDB0E769414AF56D520D263914
>> gpg: Can't check signature: No public key

I wonder if I missed something.




On Sun, May 29, 2022 at 11:25 AM Russell Spitzer 
wrote:

> Do you see this at the bottom of your KEY download?
>
> pub   rsa4096 2022-05-26 [SC]
>   E2D007FD1D743D900417C433F1EB837ECD365E04
> uid   [ultimate] Russell Spitzer (CODE SIGNING KEY) 
> 
> sig 3F1EB837ECD365E04 2022-05-26  Russell Spitzer (CODE SIGNING KEY) 
> 
> sub   rsa4096 2022-05-26 [E]
> sig  F1EB837ECD365E04 2022-05-26  Russell Spitzer (CODE SIGNING KEY) 
> 
>
> -BEGIN PGP PUBLIC KEY BLOCK-
>
> mQINBGKOyGoBEADL2+IijVhSlCliDOSFTmsQqQ+pIVFjdZj1IVr9WtCXtOl+6/EX
> m7VIbzFl4DAQ9C0/BT7SiLXmnO0RMIRv+nVPm3i7zAB9UnYuEw8aZR9KQzPySt7b
> E7ZvAB453bOuJGOsNn4fO66OL/r1Oq6S7sJ0xXeVv1uZdWdoeVFIbNdRIIyx8BOx
> BqtqQnC3ORygDw6Pquy7EX3KkrV6xxiXaK4mFo1ETDKzyg17IOmKnmQ9WkCXEwmr
> iVSbYCYuIwgXYAE6274o9Eol+H8xacvYY6hg/S/oU36ttgawv8clQ/AcqEUvS9C5
> 3cVdrwN+DxtfuqiKmnVodsdpjDEB3wbdGN7lEikYxzFeVaKYdJHRqTuOp/RCERdb
> g9vGMgR6Smo1qXVYvf76Kczf6F+kmqRd7BRy7xWZPhcMJ7yEgHzz0pes2TomA+67
> 7/enlwPxDO186OZBKuUlS4A6i1964IpCgwToNYFouzSbU4BirhglzDXzaeGwye4H
> OeuvlEFNdI20fmSF8XDyQ7WmKVwktrNTHaiLRXtYWUFRxjeAvDRbf9Tqq2AHTrQ7
> fGI7vDgEKHMZqW7WUJ/mHdklZLtopPKRtzGdYJjVt1qdJMFfYnAffXsmTc6FoDOr
> F7q+E1pwrRlKASvdaCcGShG0EvIL+dWl9a1Z1/YvA3WH+NxNflstnSXbtQARAQAB
> tD5SdXNzZWxsIFNwaXR6ZXIgKENPREUgU0lHTklORyBLRVkpIDxydXNzZWxsc3Bp
> dHplckBhcGFjaGUub3JnPokCUQQTAQgAOxYhBOLQB/0ddD2QBBfEM/Hrg37NNl4E
> BQJijshqAhsDBQsJCAcCAiICBhUKCQgLAgQWAgMBAh4HAheAAAoJEPHrg37NNl4E
> ipUP/02iLEbQDLdRCosheyvFP27Xi05BV37HO7xjG5FXMRrERhD5WG597YshHNxy
> IOPvyc+tdhRdLpFb+hgm4RRnGGm0Zjc0fT13d1nnQ0y+JnG8Dy/uVI76/vdJppNs
> NADuv99g0ckR+61Nj25+nP86zBDiIKy/iIdNj6WgTIibB0kuAavbj7ffYJ10Cj+P
> kb6Y09qzn/8lM/ScmDy90I8TbPFnAkg5wBGG+Qn8T/x4WQaSyBSb6stYWIB3O4Vu
> 5GVxRQmYNErKtnAcOzzWYMQAXJno3rKt6cYi8NPllzsf4h5uLgR6B95lVL3Ib1aA
> 2gITVGvK0nDpevJSXEcmQvWNaBYSbcNrLZHR1OsY7XgaHXndbPGdBZBb59huLC0i
> VFXkXXOZFNl/qiEo8w4/AoB0H1JinUy7Z0DZwpELt3854nWLn3PN34T6ePJccR6G
> AZMML3R6O0qv6dea41b/GNBEgiO4A4FR2lvX7NmcjltsDvquh/5S4CBJb3HF5EvW
> zVSi2Tnda+qesPimHCybykyb0e129io4uYfW11OKB1cey5e68j/XZQGwHjggqThO
> 0ww4A4jP8T06iiysNFuKt3oUNLq57Dcstg4+1Ax+/7jUtKDXF/bMXRKk167wZ+ma
> wDr/8B4z/u8c8mHxyN1O5/FnK5weJwjfzz8fq95f0YVbbnCSuQINBGKOyGoBEAC6
> q0Fy2oF8x1huRcHS/BiddbOLHGo9VEmD/MGAVVgC586t14zhIEb+CvGYTVVjvpm3
> /AX8pyEXRe1g1ZarrKLAKyvArISy+loqaPnAgF01LEUzwMa1rgEpzg8994I4R/+c
> wKv2eznH2702HjItbKVsKXtU7fwLA917DBuJqL/8HwcoTB/jQnsQq/R9UjSYnBcb
> 2l5SdcdySZla6Qp2toeic96vfI3u9Jal1cm4wYYM3gpAaX31ObNOnN9PTiVkL7UW
> le9qAwCEZao8jxzjqJvQeqQDgd+xokKwlgMGkcf9hhsyX47qVEGnpvmI/T3TlNqb
> nRMM/o0nayAfbBvOaswLpf+UM4uxpmmBkJIHrsv6ae5u2OBp5ehC+myeaLo4gZZh
> gXlE6kBoGQi/saZN0L4PFeNS5gmegLn4fvQqYv1vJitXyEPgfrjRNTZr4E1QTjwy
> vwCf+UjLqk7h3QjHwqzFbEq4MPA6qrmq6xAQsYGg0+UzCaOqiDspn0N8fELBC695
> CQ6qniOoAQe2RIv8mRtVHEBdCtASX+O7C0J0dVJvPXH+CVe2R/FqJxsDvMNySCbz
> LXupgSOfzsftPB+6qd4s3751g4sfQfOch31Rwk1K2N3SqpjypNO/6bYKnKehc4/1
> kjBOQlDLsv6A2nejxk8RD4oQNn1unm+rwHFzA6vC1QARAQABiQI2BBgBCAAgFiEE
> 4tAH/R10PZAEF8Qz8euDfs02XgQFAmKOyGoCGwwACgkQ8euDfs02XgSkMw//WdZ+
> ozufoPOdLsCWkvNZu0goC8ukeKFTwTnX6SdIsRNS2wqbcYnHsDlZcNNwGXOirqmM
> 1gYt6+rUC/IXJ/v6ApGf6UYup/c+w0iqzTnyZNS49sPFgKmegQrijKVi9We+hA4T
> o8omjozR2i5J0mZy0wgZ0PN31M/NekSwEYI1AepjsExcMSIuvYpgXToK4wDh8qZG
> YLNrd1yq3UhIwkMLVr8myg/yt1TjE24upyLE3NvTMbW88tm+UjVu6J3r0rIFVpn+
> 7xt0T6U0XhM+c44F6LR+GDLiRY7l1OfVlvARohICWgk5zR1cPwKXAeUxYA36zETC
> G23daNm7HmaaEzbRDVcvShn+WyZLfh/iSAu2cz+GwYMrXyBx4K2mmE5mGofJPjpZ
> W1Kz2F9oYm2Ev6HleOgWOaebL09eSwKderQrYKRht/xrnuaDsL60LttiS4UuR4xY
> zdd+xI6Vx9XFjmbDL24k7fQ8wNy3rhmRyrQRFYYMNLNoH29eRMO3CkcbMB3MNvME
> mk43A19XoEEHQ52Tv7+aTVzrS5QjzKvcQ+62eEE15k0XH39/ZCYPikR8XEqs0YkO
> wdFeyrBN22jtT48jMJ4IFw4odabqOqBn6Wazx3tBg0ZMTxn/i2H4tHpe78RIj/7Z
> 7eLhkMY0meA64TMBCc0aS3ffCnJzetWOSpgjv9o=
> =gy3b
> -END PGP PUBLIC KEY BLOCK-
>
>
>
> On May 28, 2022, at 2:04 PM, Szehon Ho  wrote:
>
> Hi
>
> For gpg verify KEYS i get:
> gpg: Can't check signature: No public key
>
> I imported latest keys and do see key for :
> uid   Russell Spitzer (CODE SIGNING KEY) <
> russellspit...@apache.org>
> sub   rsa4096 2022-05-26 [E]
>
> but maybe no public key?  Maybe I am missing something obvious.
>
> Also wanted to ask, can we get this one in as well:
> https://github.com/apache/iceberg/pull/4720 : user cannot use projection
> on partitions table.
>
> If so

Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-29 Thread Szehon Ho
On the other topic, the pr for 0.13 branch is merged:
https://github.com/apache/iceberg/pull/4890, my preference will be to
include this in new RC to solve the aforementioned issue :
https://github.com/apache/iceberg/issues/4718.

Thanks,
Szehon

On Sun, May 29, 2022 at 2:59 PM Szehon Ho  wrote:

> Yea it does, I ran:
>
> curl -O https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> gpg --import KEYS
> gpg --verify apache-iceberg-0.13.2.tar.gz.asc
>
> >> gpg: assuming signed data in 'apache-iceberg-0.13.2.tar.gz'
> >> gpg: Signature made Wed May 25 18:11:35 2022 PDT
> >> gpg:using EDDSA key
> 0E8C186954C8D6EDB0E769414AF56D520D263914
> >> gpg: Can't check signature: No public key
>
> I wonder if I missed something.
>
>
>
>
> On Sun, May 29, 2022 at 11:25 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Do you see this at the bottom of your KEY download?
>>
>> pub   rsa4096 2022-05-26 [SC]
>>   E2D007FD1D743D900417C433F1EB837ECD365E04
>> uid   [ultimate] Russell Spitzer (CODE SIGNING KEY) 
>> 
>> sig 3F1EB837ECD365E04 2022-05-26  Russell Spitzer (CODE SIGNING KEY) 
>> 
>> sub   rsa4096 2022-05-26 [E]
>> sig  F1EB837ECD365E04 2022-05-26  Russell Spitzer (CODE SIGNING KEY) 
>> 
>>
>> -BEGIN PGP PUBLIC KEY BLOCK-
>>
>> mQINBGKOyGoBEADL2+IijVhSlCliDOSFTmsQqQ+pIVFjdZj1IVr9WtCXtOl+6/EX
>> m7VIbzFl4DAQ9C0/BT7SiLXmnO0RMIRv+nVPm3i7zAB9UnYuEw8aZR9KQzPySt7b
>> E7ZvAB453bOuJGOsNn4fO66OL/r1Oq6S7sJ0xXeVv1uZdWdoeVFIbNdRIIyx8BOx
>> BqtqQnC3ORygDw6Pquy7EX3KkrV6xxiXaK4mFo1ETDKzyg17IOmKnmQ9WkCXEwmr
>> iVSbYCYuIwgXYAE6274o9Eol+H8xacvYY6hg/S/oU36ttgawv8clQ/AcqEUvS9C5
>> 3cVdrwN+DxtfuqiKmnVodsdpjDEB3wbdGN7lEikYxzFeVaKYdJHRqTuOp/RCERdb
>> g9vGMgR6Smo1qXVYvf76Kczf6F+kmqRd7BRy7xWZPhcMJ7yEgHzz0pes2TomA+67
>> 7/enlwPxDO186OZBKuUlS4A6i1964IpCgwToNYFouzSbU4BirhglzDXzaeGwye4H
>> OeuvlEFNdI20fmSF8XDyQ7WmKVwktrNTHaiLRXtYWUFRxjeAvDRbf9Tqq2AHTrQ7
>> fGI7vDgEKHMZqW7WUJ/mHdklZLtopPKRtzGdYJjVt1qdJMFfYnAffXsmTc6FoDOr
>> F7q+E1pwrRlKASvdaCcGShG0EvIL+dWl9a1Z1/YvA3WH+NxNflstnSXbtQARAQAB
>> tD5SdXNzZWxsIFNwaXR6ZXIgKENPREUgU0lHTklORyBLRVkpIDxydXNzZWxsc3Bp
>> dHplckBhcGFjaGUub3JnPokCUQQTAQgAOxYhBOLQB/0ddD2QBBfEM/Hrg37NNl4E
>> BQJijshqAhsDBQsJCAcCAiICBhUKCQgLAgQWAgMBAh4HAheAAAoJEPHrg37NNl4E
>> ipUP/02iLEbQDLdRCosheyvFP27Xi05BV37HO7xjG5FXMRrERhD5WG597YshHNxy
>> IOPvyc+tdhRdLpFb+hgm4RRnGGm0Zjc0fT13d1nnQ0y+JnG8Dy/uVI76/vdJppNs
>> NADuv99g0ckR+61Nj25+nP86zBDiIKy/iIdNj6WgTIibB0kuAavbj7ffYJ10Cj+P
>> kb6Y09qzn/8lM/ScmDy90I8TbPFnAkg5wBGG+Qn8T/x4WQaSyBSb6stYWIB3O4Vu
>> 5GVxRQmYNErKtnAcOzzWYMQAXJno3rKt6cYi8NPllzsf4h5uLgR6B95lVL3Ib1aA
>> 2gITVGvK0nDpevJSXEcmQvWNaBYSbcNrLZHR1OsY7XgaHXndbPGdBZBb59huLC0i
>> VFXkXXOZFNl/qiEo8w4/AoB0H1JinUy7Z0DZwpELt3854nWLn3PN34T6ePJccR6G
>> AZMML3R6O0qv6dea41b/GNBEgiO4A4FR2lvX7NmcjltsDvquh/5S4CBJb3HF5EvW
>> zVSi2Tnda+qesPimHCybykyb0e129io4uYfW11OKB1cey5e68j/XZQGwHjggqThO
>> 0ww4A4jP8T06iiysNFuKt3oUNLq57Dcstg4+1Ax+/7jUtKDXF/bMXRKk167wZ+ma
>> wDr/8B4z/u8c8mHxyN1O5/FnK5weJwjfzz8fq95f0YVbbnCSuQINBGKOyGoBEAC6
>> q0Fy2oF8x1huRcHS/BiddbOLHGo9VEmD/MGAVVgC586t14zhIEb+CvGYTVVjvpm3
>> /AX8pyEXRe1g1ZarrKLAKyvArISy+loqaPnAgF01LEUzwMa1rgEpzg8994I4R/+c
>> wKv2eznH2702HjItbKVsKXtU7fwLA917DBuJqL/8HwcoTB/jQnsQq/R9UjSYnBcb
>> 2l5SdcdySZla6Qp2toeic96vfI3u9Jal1cm4wYYM3gpAaX31ObNOnN9PTiVkL7UW
>> le9qAwCEZao8jxzjqJvQeqQDgd+xokKwlgMGkcf9hhsyX47qVEGnpvmI/T3TlNqb
>> nRMM/o0nayAfbBvOaswLpf+UM4uxpmmBkJIHrsv6ae5u2OBp5ehC+myeaLo4gZZh
>> gXlE6kBoGQi/saZN0L4PFeNS5gmegLn4fvQqYv1vJitXyEPgfrjRNTZr4E1QTjwy
>> vwCf+UjLqk7h3QjHwqzFbEq4MPA6qrmq6xAQsYGg0+UzCaOqiDspn0N8fELBC695
>> CQ6qniOoAQe2RIv8mRtVHEBdCtASX+O7C0J0dVJvPXH+CVe2R/FqJxsDvMNySCbz
>> LXupgSOfzsftPB+6qd4s3751g4sfQfOch31Rwk1K2N3SqpjypNO/6bYKnKehc4/1
>> kjBOQlDLsv6A2nejxk8RD4oQNn1unm+rwHFzA6vC1QARAQABiQI2BBgBCAAgFiEE
>> 4tAH/R10PZAEF8Qz8euDfs02XgQFAmKOyGoCGwwACgkQ8euDfs02XgSkMw//WdZ+
>> ozufoPOdLsCWkvNZu0goC8ukeKFTwTnX6SdIsRNS2wqbcYnHsDlZcNNwGXOirqmM
>> 1gYt6+rUC/IXJ/v6ApGf6UYup/c+w0iqzTnyZNS49sPFgKmegQrijKVi9We+hA4T
>> o8omjozR2i5J0mZy0wgZ0PN31M/NekSwEYI1AepjsExcMSIuvYpgXToK4wDh8qZG
>> YLNrd1yq3UhIwkMLVr8myg/yt1TjE24upyLE3NvTMbW88tm+UjVu6J3r0rIFVpn+
>> 7xt0T6U0XhM+c44F6LR+GDLiRY7l1OfVlvARohICWgk5zR1cPwKXAeUxYA36zETC
>> G23daNm7HmaaEzbRDVcvShn+WyZLfh/iSAu2cz+GwYMrXyBx4K2mmE5mGofJPjpZ
>> W1Kz2F9oYm2Ev6HleOgWOaebL09eSwKderQrYKRht/xrnuaDsL60LttiS4UuR4xY
>> zdd+xI6Vx9XFjmbDL24k7fQ8wNy3rhmRyrQRFYYMNLNoH29eRMO3CkcbMB3MNvME
>> mk43A19XoEEHQ52Tv7+aTVzrS5QjzKvcQ+62eEE15k0XH39/ZCYPikR8XEqs0YkO
>> wdFeyrBN22jtT48jMJ4IFw4odabqOqBn6Wazx3tBg0ZMTxn/i2H4tHpe78RIj/7Z
>> 7eLhkMY0meA64TMBCc0aS3ffC

Re: [VOTE] Release Apache Iceberg 0.13.2 RC1

2022-06-06 Thread Szehon Ho
+1 (non-binding)


   1. Verified signatures
   2. Verified checksums
   3. RAT checks
   4. Build and test
   5. Tested with Spark 3.2, create a table and run a few queries

Thanks
Szehon

On Mon, Jun 6, 2022 at 10:46 AM Daniel Weeks 
wrote:

> +1 (binding)
>
> verified sigs/sums/license/build/tests
>
> As for the detached commit, I believe I commented on this in a prior
> release and the parent commit is the head of the 0.13.x branch and the
> detached commit is just the version bump, so I'm ok with it, but it sure
> would be nice if that wasn't detached.
>
> -Dan
>
> On Sun, Jun 5, 2022 at 10:27 PM Kyle Bendickson  wrote:
>
>> Update:
>>
>> Running the test suite in IntelliJ that was (and is) having consistent
>> test failures via CLI, the issue seems to be resolved.
>> So I do think it is indeed a local JVM set up issue.
>>
>> Investigating the differences now, but the class in question is
>> *org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine*
>>
>> It seems to be caused by a NoClassDefFoundError, specifically for
>> org.xerial.snappy.Snappy. It also happens for ORC, but not for parquet.
>>
>> Included is a sample output:
>> ```
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.xerial.snappy.Snappy
>> at
>> org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:99)
>> ~[snappy-java-1.1.8.jar:1.1.8]
>> at
>> org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:91)
>> ~[snappy-java-1.1.8.jar:1.1.8]
>> at
>> org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:81)
>> ~[snappy-java-1.1.8.jar:1.1.8]
>> at
>> org.apache.tez.common.TezUtils.createByteStringFromConf(TezUtils.java:81)
>> ~[tez-api-0.10.1.jar:0.10.1]
>> ```
>>
>> Apologies for speaking too soon. *I'm now +0 [non-binding] *provided we
>> fix the 0.13.x branch and associated commitId to not be in a detached
>> state. The tag *apache-iceberg-0.13.2-rc1 *works just fine, but the
>> 0.13.x branch doesn't have the commit ID in question. Not sure if that's a
>> major concern or not.
>>
>> Cheers,
>> Kyle
>>
>> On Sun, Jun 5, 2022 at 11:51 AM Kyle Bendickson  wrote:
>>
>>> Thanks Eduard!
>>>
>>> I have:
>>> - verified the signature
>>> - verified the checksum in the file given as well as of the artifact
>>> - ran all unit tests on Java 11, all passed
>>> - ran all unit tests on Java 8, some hive-3 tests consistently fail (I
>>> do notice they passed on Github - but the tests which fail are consistent
>>> despite giving the JVM more memory and checking for OOM)
>>> - ran a simple smoke test suite of CRUD on namespaces and v1 and v2
>>> tables with Spark (3.2, 3.1) and Flink (1.13 and 1.14).
>>> - ran some upsert related tests on Flink 1.13 and 1.14 (1.12 is provided
>>> a deprecation notice)
>>>
>>> *Problems:*
>>> I did notice that the *given commit ID is considered unattached (and I
>>> wasn't able to check it out).* I am running my tests by using the
>>> provided JAR with engines and then running unit tests locally for the
>>> commit just prior (with commit ID
>>> *fae977a9f0a79266a04647b0df2ab540cf0dcff4*).
>>>
>>> Not sure if this is a huge issue, but outside of this unattached commit,
>>> my only concern is the `iceberg-hive3` failing tests, but as they passed in
>>> CI it's possibly an issue with my local setup locally.
>>>
>>> Running hive-3 test suite alone, the same tests failed multiple times
>>> but again might be something to do with my computer / JVM configuration.
>>>
>>> *I am -1 (non-binding)*, primarily based on the detached commit (as I
>>> had quite a good bit of trouble trying to fetch it through my normal
>>> processes) as well as the failing hive3 tests (though that's not exactly
>>> within my area of expertise).
>>>
>>> If the hive3 test failures are only something that occurs for me, then
>>> if we fix the "Add version.txt commit" in branch 0.3.x such that when I
>>> fetch branch 0.3.x it's present, I'd be +1. Unfortunately, I can't help
>>> with cleaning up with the release branch outside of advising somebody else
>>> (if desired), but I'm happy to help with that.
>>>
>>> The hive3 test failures for me seem to be OOM related, but I raised my
>>>
>>> Find attached a picture of the detached commit ID,
>>> *0784d64a659abd4fdaa82cdb599a250a7514facf*, per Github.
>>>
>>> [image: image.png]
>>>
>>> Example test failures
>>> org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine >
>>> testCBOWithSelectedColumnsOverlapJoin[fileFormat=AVRO, engine=tez,
>>> catalog=HIVE_CATALOG, isVectorized=false] FAILED
>>> java.lang.IllegalArgumentException: Failed to execute Hive query
>>> 'SELECT c.first_name, o.order_id FROM default.orders o JOIN
>>> default.customers c ON o.customer_id = c.customer_id ORDER BY o.order_id
>>> DESC': Error while processing statement: FAILED: Execution Error, return
>>> code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
>>> at
>>> org.apache.iceberg.mr.hive.TestHiveShell.executeStatement(TestHiveShell.java:152)
>>> at
>>> 

  1   2   >