Re: Several flink pull requests need to get merged before the next release 0.10.0

2020-10-29 Thread Kyle Bendickson
I will go through and re-review all of these PRs over the next two days
Zheng to help get these merged asap.

- Kyle
@kbendick

On Tue, Oct 27, 2020 at 1:30 AM OpenInx  wrote:

> Hi Ryan
>
> Is it the correct time once we get the PR 1477 merged ?  Do we have any
> other blockers for the coming release 0.10.0 ?
>
> Thanks.
>
> On Wed, Oct 21, 2020 at 9:13 AM Ryan Blue 
> wrote:
>
>> Hey, thanks for bringing these up. I'm planning on spending some time
>> reviewing tomorrow and I can take a look at the first two.
>>
>> I just merged the first one since it was small, thanks for the fix! Feel
>> free to ping me or other committers to review these. I do think it is
>> important to have a committer review, even if the community also has
>> positive reviews.
>>
>> rb
>>
>> On Mon, Oct 19, 2020 at 7:15 PM OpenInx  wrote:
>>
>>> Hi
>>>
>>> As we know that we next release 0.10.0 is coming, there are several
>>> issues which should be merged as soon as possible in my mind:
>>>
>>> 1. https://github.com/apache/iceberg/pull/1477
>>>
>>> It will change the flink state design to maintain the complete data
>>> files into manifest before checkpoint finished,  it good for minimal the
>>> flink state size and improve the state compatibility (Before that we will
>>> serialize the DataFile into flink state backend, while DataFile class have
>>> depended some java serializable classes, the means if we change the
>>> dependencies classes,  it may fail to deserialize the state).  Currently,
>>> I gained a +1 from  Steven Zhen Wu, thanks for his patient reviewing.
>>> According to the apache rule,  I need another +1 from iceberg committers,
>>> anyone have time to get the review finished ?
>>>
>>> 2. https://github.com/apache/iceberg/pull/1586
>>>
>>> This will introduce options to load the external hive-site.xml for flink
>>> hive catalog, which is really helpful for production environment, not a
>>> hard change.  But will still need a review from iceberg members.  Thanks.
>>>
>>> 3. https://github.com/apache/iceberg/pull/1619
>>>
>>> We introduced another write parallelism for iceberg flink stream
>>> writers. Thanks kbendick and Stevenzwu for the reviewing,  gain two +1
>>> now.  Should I merge this ?
>>>
>>>
>>> Besides the flink PRs,  it is very beneficial to put forward other
>>> related issues which is blocking the release 0.10.0 .  I am happy to help
>>> resolve these issues.
>>>
>>> Thanks.
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


PR to switch to Github Actions and fix the PR Labeler

2020-10-29 Thread Kyle Bendickson
The PR Labeler has been dead for a few weeks now.

I have an open PR to address that by moving it to a github action:
https://github.com/apache/iceberg/pull/1686

I've tested in my own fork, and assuming that ASF Infra has enabed github
actions across all of the `apache` github organization, then this workflow
should just work. I've tested at least 4 or 5 different cases in my
personal fork.

Any reviews possible so that I know if I need to reach out to ASF Infra
would be greatly appreciated.

Also, if the labeler doesn't have something you want / that was previously
there, again, would love reviews or if it's ben merged an issue can be open.

It wiiould be great to have PRs labeled again. I've also opened an issue in
the Spark JIRA and on the dev list if somebody wants to assign it to me,
I;ll be sure it gets done int he next few days:
https://issues.apache.org/jira/browse/SPARK-33282

Thanks
Kyle Bendickson
Github: @kbendick


Re: Several flink pull requests need to get merged before the next release 0.10.0

2020-10-29 Thread Kyle Bendickson
Oops spoke too soon. Looks like they've all been merged.and I approved the
last one /shrug. Sorry for the late night email response everyone.

- Kyle

On Thu, Oct 29, 2020 at 12:22 AM Kyle Bendickson 
wrote:

> I will go through and re-review all of these PRs over the next two days
> Zheng to help get these merged asap.
>
> - Kyle
> @kbendick
>
> On Tue, Oct 27, 2020 at 1:30 AM OpenInx  wrote:
>
>> Hi Ryan
>>
>> Is it the correct time once we get the PR 1477 merged ?  Do we have any
>> other blockers for the coming release 0.10.0 ?
>>
>> Thanks.
>>
>> On Wed, Oct 21, 2020 at 9:13 AM Ryan Blue 
>> wrote:
>>
>>> Hey, thanks for bringing these up. I'm planning on spending some time
>>> reviewing tomorrow and I can take a look at the first two.
>>>
>>> I just merged the first one since it was small, thanks for the fix! Feel
>>> free to ping me or other committers to review these. I do think it is
>>> important to have a committer review, even if the community also has
>>> positive reviews.
>>>
>>> rb
>>>
>>> On Mon, Oct 19, 2020 at 7:15 PM OpenInx  wrote:
>>>
>>>> Hi
>>>>
>>>> As we know that we next release 0.10.0 is coming, there are several
>>>> issues which should be merged as soon as possible in my mind:
>>>>
>>>> 1. https://github.com/apache/iceberg/pull/1477
>>>>
>>>> It will change the flink state design to maintain the complete data
>>>> files into manifest before checkpoint finished,  it good for minimal the
>>>> flink state size and improve the state compatibility (Before that we will
>>>> serialize the DataFile into flink state backend, while DataFile class have
>>>> depended some java serializable classes, the means if we change the
>>>> dependencies classes,  it may fail to deserialize the state).  Currently,
>>>> I gained a +1 from  Steven Zhen Wu, thanks for his patient reviewing.
>>>> According to the apache rule,  I need another +1 from iceberg committers,
>>>> anyone have time to get the review finished ?
>>>>
>>>> 2. https://github.com/apache/iceberg/pull/1586
>>>>
>>>> This will introduce options to load the external hive-site.xml for
>>>> flink hive catalog, which is really helpful for production environment, not
>>>> a hard change.  But will still need a review from iceberg members.  Thanks.
>>>>
>>>> 3. https://github.com/apache/iceberg/pull/1619
>>>>
>>>> We introduced another write parallelism for iceberg flink stream
>>>> writers. Thanks kbendick and Stevenzwu for the reviewing,  gain two +1
>>>> now.  Should I merge this ?
>>>>
>>>>
>>>> Besides the flink PRs,  it is very beneficial to put forward other
>>>> related issues which is blocking the release 0.10.0 .  I am happy to help
>>>> resolve these issues.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>


Re: Next community sync

2021-05-25 Thread Kyle Bendickson
Hi Ryan,

Can you please add my new work email to the community sync? kbendickson [at] 
apple [dot ]com

Thanks,
Kyle!


Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendick...@apple.com <mailto:kbendick...@apple.com>
This email and any attachments may be privileged and may contain confidential 
information intended only for the recipient(s) named above. Any other 
distribution, forwarding, copying or disclosure of this message is strictly 
prohibited. If you have received this email in error, please notify me 
immediately by telephone or return email, and delete this message from your 
system.


> On May 24, 2021, at 4:56 PM, Ryan Blue  <mailto:b...@apache.org>> wrote:
> 
> Hi everyone,
> 
> When I was out on paternity leave, I let the community syncs lapse. Sorry 
> about the gap! I've scheduled a new one for next Wednesday and we can resume 
> the old schedule (alternating time zone every 3 weeks) from there. Please add 
> topics to the agenda/notes doc!



Re: Iceberg 0.12 release ETA

2021-06-16 Thread Kyle Bendickson
Hi Justin,

Unfortunately, the sync has not been brought back to the mailing list yet…
until now!

https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit

The above is a running doc with what is discussed. I think it should work
for view / comment access (not a committer - yet :) - but I believe I have
the ability to share this link).

Notably, we've decided going forward to start using the wiki available to
us.

I believe that somebody has volunteered to spearhead the initial collection
of the wiki contents, to include design docs, on-going and past design
proposals, and other notes of importance (such as the sync meeting notes).

I have offered my assistance in helping with this endeavor, so I will be
sure that we send out the sync notes as well.

Perhaps given that they take place online, a recording could be offered?
Not sure what the policies are around that but happy to help where I can to
ensure the project is meeting its goals and requirements while remaining
compliant however required.

All the best,
Kyle Bendickson, @kbendick on GitHub
OSS Developer at Apple
kjbendick...@gmail.com / kbendickson[at]
apple[dot]com

On Tue, Jun 15, 2021 at 6:14 PM Justin Mclean  wrote:

> Hi,
>
> > We haven't set a date for the 0.12 release yet, but we're going to be
> > discussing this in the Iceberg sync tomorrow if you'd like to join.
>
> Was what was discussed at the sync brought back to the mailing list?
> Remember not everyone in your community can attend synchronous meetings and
> may missing out of information discussed there.
>
> Kind Regards,
> Justin
>


Re: Iceberg 0.12 release ETA

2021-06-16 Thread Kyle Bendickson
Additionally, I don’t believe a firm date for the 0.12 release has been
brought up as there is on going work on the v2 format.

Somebody please correct me if I'm wrong!

- Kyle

On Wed, Jun 16, 2021 at 2:23 AM Kyle Bendickson 
wrote:

> Hi Justin,
>
> Unfortunately, the sync has not been brought back to the mailing list yet…
> until now!
>
>
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit
>
> The above is a running doc with what is discussed. I think it should work
> for view / comment access (not a committer - yet :) - but I believe I have
> the ability to share this link).
>
> Notably, we've decided going forward to start using the wiki available to
> us.
>
> I believe that somebody has volunteered to spearhead the initial
> collection of the wiki contents, to include design docs, on-going and past
> design proposals, and other notes of importance (such as the sync meeting
> notes).
>
> I have offered my assistance in helping with this endeavor, so I will be
> sure that we send out the sync notes as well.
>
> Perhaps given that they take place online, a recording could be offered?
> Not sure what the policies are around that but happy to help where I can to
> ensure the project is meeting its goals and requirements while remaining
> compliant however required.
>
> All the best,
> Kyle Bendickson, @kbendick on GitHub
> OSS Developer at Apple
> kjbendick...@gmail.com / kbendickson[at]
> apple[dot]com
>
> On Tue, Jun 15, 2021 at 6:14 PM Justin Mclean  wrote:
>
>> Hi,
>>
>> > We haven't set a date for the 0.12 release yet, but we're going to be
>> > discussing this in the Iceberg sync tomorrow if you'd like to join.
>>
>> Was what was discussed at the sync brought back to the mailing list?
>> Remember not everyone in your community can attend synchronous meetings and
>> may missing out of information discussed there.
>>
>> Kind Regards,
>> Justin
>>
>


Re: iceberg code style

2021-06-16 Thread Kyle Bendickson
This would be good content for the wiki.

I believe I have some notes / screenshots for setting this up in IntelliJ
that I would be happy to share.

I’ll reach out to Carl to see how that might be possible.

I will also look for these notes on my laptop tomorrow and send them in the
thread if need be.

- Kyle

On Wed, Jun 16, 2021 at 1:37 AM Eduard Tudenhoefner 
wrote:

> Hello,
>
> There is a code formatter that can be used for IntellJ in the
> https://github.com/apache/iceberg/tree/master/.baseline/idea folder that
> you can import.
>
>
>
> On Wed, Jun 16, 2021 at 10:23 AM 1  wrote:
>
>> Hi, all:
>>
>>   I want to code iceberg with idea, so is there a code style ?
>>
>> thx
>>
>> liubo07199
>>
>>
>> 
>>
>>


Re: Iceberg 0.12 release ETA

2021-06-16 Thread Kyle Bendickson
Actually, according to the link I just sent the ETA is currently early July
for 0.12.

- Apologies, Kyle

On Wed, Jun 16, 2021 at 2:24 AM Kyle Bendickson 
wrote:

> Additionally, I don’t believe a firm date for the 0.12 release has been
> brought up as there is on going work on the v2 format.
>
> Somebody please correct me if I'm wrong!
>
> - Kyle
>
> On Wed, Jun 16, 2021 at 2:23 AM Kyle Bendickson 
> wrote:
>
>> Hi Justin,
>>
>> Unfortunately, the sync has not been brought back to the mailing list
>> yet… until now!
>>
>>
>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit
>>
>> The above is a running doc with what is discussed. I think it should work
>> for view / comment access (not a committer - yet :) - but I believe I have
>> the ability to share this link).
>>
>> Notably, we've decided going forward to start using the wiki available to
>> us.
>>
>> I believe that somebody has volunteered to spearhead the initial
>> collection of the wiki contents, to include design docs, on-going and past
>> design proposals, and other notes of importance (such as the sync meeting
>> notes).
>>
>> I have offered my assistance in helping with this endeavor, so I will be
>> sure that we send out the sync notes as well.
>>
>> Perhaps given that they take place online, a recording could be offered?
>> Not sure what the policies are around that but happy to help where I can to
>> ensure the project is meeting its goals and requirements while remaining
>> compliant however required.
>>
>> All the best,
>> Kyle Bendickson, @kbendick on GitHub
>> OSS Developer at Apple
>> kjbendick...@gmail.com / kbendickson[at]
>> apple[dot]com
>>
>> On Tue, Jun 15, 2021 at 6:14 PM Justin Mclean  wrote:
>>
>>> Hi,
>>>
>>> > We haven't set a date for the 0.12 release yet, but we're going to be
>>> > discussing this in the Iceberg sync tomorrow if you'd like to join.
>>>
>>> Was what was discussed at the sync brought back to the mailing list?
>>> Remember not everyone in your community can attend synchronous meetings and
>>> may missing out of information discussed there.
>>>
>>> Kind Regards,
>>> Justin
>>>
>>


Re: [VOTE] Adopt the v2 spec changes

2021-07-28 Thread Kyle Bendickson
+1 (non-binding)


Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendick...@apple.com

This email and any attachments may be privileged and may contain confidential 
information intended only for the recipient(s) named above. Any other 
distribution, forwarding, copying or disclosure of this message is strictly 
prohibited. If you have received this email in error, please notify me 
immediately by telephone or return email, and delete this message from your 
system.


> On Jul 27, 2021, at 9:58 AM, Ryan Blue  wrote:
> 
> I’d like to propose that we adopt the pending v2 spec changes as the 
> supported v2 spec. The full list of changes is documented in the v2 summary 
> section of the spec <https://iceberg.apache.org/spec/#version-2>.
> 
> The major breaking change is the addition of delete files and metadata to 
> track delete files. In addition, there are a few other minor breaking 
> changes. For example, v2 drops the block_size_in_bytes field in manifests 
> that was previously required and also omits fields in table metadata that are 
> now tracked by lists; schema is no longer written in favor of schemas. Other 
> changes are forward compatible, mostly tightening field requirements where 
> possible (e.g., schemas and current-schema-id are now required).
> 
> Adopting the changes will signal that the community intends to support the 
> current set of changes and will guarantee forward-compatibility for v2 tables 
> that implement the current v2 spec. Any new breaking changes would go into v3.
> 
> Please vote on adopting the v2 changes in the next 72 hours.
> 
> [ ] +1 Adopt the changes as v2
> [ ] +0
> [ ] -1 Do not adopt the changes, because…
> 
> -- 
> Ryan Blue



Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-06 Thread Kyle Bendickson
+1 (binding)

I verified:
 - KEYS signature & checksum
 - ./gradlew clean build (tests, etc)
 - Ran Spark jobs on Kubernetes after building from the tarball at
https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
 - Spark 3.1.1 batch jobs against both Hadoop and Hive tables, using
HMS for Hive catalog
 - Verified default FileIO and S3FileIO
 - Basic read and writes
 - Jobs using Spark procedures (remove unreachable files)
 - Special mention: verified that Spark catalogs can override hadoop
configurations using configs prefixed with
"spark.sql.catalog.(catalog-name).hadoop."
 - one of my contributions to this release that has been asked about by
several customers internally
 - tested using `spark.sql.catalog.(catalog-name).hadoop.fs.s3a.impl`
for two catalogs, both values respected as opposed to the default globally
configured value

Thank you Carl!

- Kyle, Data OSS Dev @ Apple =)

On Thu, Aug 5, 2021 at 11:49 PM Szehon Ho  wrote:

> +1 (non-binding)
>
> * Verify Signature Keys
> * Verify Checksum
> * dev/check-license
> * Build
> * Run tests (though some timeout failures, on Hive MR test..)
>
> Thanks
> Szehon
>
> On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks  wrote:
>
>> +1 (binding)
>>
>> I verified sigs/sums, license, build, and test
>>
>> -Dan
>>
>> On Wed, Aug 4, 2021 at 2:53 PM Ryan Murray  wrote:
>>
>>> After some wrestling w/ Spark I discovered that the problem was with my
>>> test. Some SparkSession apis changed. so all good here now.
>>>
>>> +1 (non-binding)
>>>
>>> On Wed, Aug 4, 2021 at 11:29 PM Ryan Murray  wrote:
>>>
 Thanks for the help Carl, got it sorted out. The gpg check now works.
 For those who were interested I used a canned wget command in my history
 and it pulled the RC0 :-)

 Will have a PR to fix the Nessie Catalog soon.

 Best,
 Ryan

 On Wed, Aug 4, 2021 at 9:21 PM Carl Steinbach 
 wrote:

> Hi Ryan,
>
> Can you please run the following command to see which keys in your
> public keyring are associated with my UID?
>
> % gpg  --list-keys c...@apache.org
> pub   rsa4096/5A5C7F6EB9542945 2021-07-01 [SC]
>   160F51BE45616B94103ED24D5A5C7F6EB9542945
> uid [ultimate] Carl W. Steinbach (CODE SIGNING KEY) <
> c...@apache.org>
> sub   rsa4096/4158EB8A4F03D2AA 2021-07-01 [E]
>
> Thanks.
>
> - Carl
>
> On Wed, Aug 4, 2021 at 11:12 AM Ryan Murray  wrote:
>
>> Hi all,
>>
>> Unfortunately I have to give -1
>>
>> I had trouble w/ the keys:
>>
>> gpg: assuming signed data in 'apache-iceberg-0.12.0.tar.gz'
>> gpg: Signature made Mon 02 Aug 2021 03:36:30 CEST
>> gpg:using RSA key
>> FAFEB6EAA60C95E2BB5E26F01FF0803CB78D539F
>> gpg: Can't check signature: No public key
>>
>> And I have discovered a bug in NessieCatalog. It is unclear what is
>> wrong but the NessieCatalog doesn't play nice w/ Spark3.1. I will raise a
>> patch ASAP to fix it. Very sorry for the inconvenience.
>>
>> Best,
>> Ryan
>>
>> On Wed, Aug 4, 2021 at 3:20 AM Carl Steinbach  wrote:
>>
>>> Hi everyone,
>>>
>>> I propose that we release RC2 as the official Apache Iceberg 0.12.0
>>> release. Please note that RC0 and RC1 were DOA.
>>>
>>> The commit id for RC2 is 7c2fcfd893ab71bee41242b46e894e6187340070
>>> * This corresponds to the tag: apache-iceberg-0.12.0-rc2
>>> *
>>> https://github.com/apache/iceberg/commits/apache-iceberg-0.12.0-rc2
>>> *
>>> https://github.com/apache/iceberg/tree/7c2fcfd893ab71bee41242b46e894e6187340070
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged in Nexus. The Maven
>>> repository URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1017/
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.12.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>


Re: [DISCUSS] Iceberg roadmap

2021-09-17 Thread Kyle Bendickson
This list looks overall pretty good to me. +1

For Flink 1.13 upgrade, I suggest we consider starting another thread for it. 
There are some open PRs, but they have outstanding questions. Specifically, 
dropping support for Flink 1.12 or not. I think we can upgrade without dropping 
support for Flink 1.12, but we wouldn’t get some of the proposed benefits of 
1.13 (though that can be a follow up task).

I’m not presently involved in the Flink Community enough to say with certainty, 
but I believe the FLIP-27 (Using the new source interface) and the Flink 1.13.2 
upgrade are orthogonal to each other and can both progress independently. But I 
would defer to Steven or anybody else who works with Flink much more often than 
I do currently.

- Kyle Bendickson

> On Sep 15, 2021, at 4:06 PM, Ryan Blue  wrote:
> 
> That sounds great, thanks for taking that on Jack!
> 
> On Wed, Sep 15, 2021 at 3:51 PM Jack Ye  <mailto:yezhao...@gmail.com>> wrote:
> For external Trino and PrestoDB tasks, I am thinking about creating one 
> Github project for Trino and another one for PrestoDB to manage all tasks 
> under them, adding links of issues and PRs in the other communities to track 
> progress. This is mostly to improve visibility so that people who are 
> interested can see what is going on in those 2 places.
> 
> -Jack Ye
> 
> On Wed, Sep 15, 2021 at 2:14 PM Ryan Blue  <mailto:b...@tabular.io>> wrote:
> Gidon, I think that the v3 part of encryption is actually documenting how it 
> works and adding it to the spec. Right now we have hooks for building some 
> encryption around it, but almost no requirements in the spec for how to use 
> it across implementations. This is fine while we're working on defining 
> encryption, but we eventually want to update the spec.
> 
> Jack, I'm happy to add the external PrestoDB items to the roadmap. I'm just 
> not quite sure what to do here since we aren't tracking them in the Iceberg 
> community ourselves. I listed those as external so that we can publish links 
> to where those are tracked in other communities. We can add as many of these 
> as we want.
> 
> Anton, I agree. The goal here is to identify the top priority items to help 
> direct review effort. We want everything to continue progressing, but I think 
> it's good to identify where we as a community want to focus review time.
> 
> Sounds like one area of uncertainty is FLIP-27 vs Flink 1.13.2. Can someone 
> summarize the status of Flink and what we need? I don't think I understand it 
> well enough to suggest which one takes priority.
> 
> Ryan
> 
> On Mon, Sep 13, 2021 at 7:54 PM Anton Okolnychyi 
>  wrote:
> The discussed roadmap makes sense to me. I think it is important to agree on 
> what we should do first as the review pool is limited. There are more and 
> more large items that are half done or half discussed. I think we better 
> focus on finishing them quickly and then move to something else as opposed to 
> making very minor progress on a number of issues.
> 
> To be clear, it is not like other things are not important or we should stop 
> their development. It is more about making sure certain high-priority 
> features for most folks in the community get enough attention.
> 
> - Anton
> 
>> On 13 Sep 2021, at 12:19, Jack Ye > <mailto:yezhao...@gmail.com>> wrote:
>> 
>> I'd like to also propose adding the following in the external section:
>> 1. the PrestoDB equivalent for each item listed for Trino. I am not sure 
>> what's the best way to track them, but I feel it's better to list and track 
>> them separately. I have talked with related people currently maintaining the 
>> PrestoDB Iceberg connector (mostly in Twitter), and they would like to take 
>> a different route from Trino to fully remove Hive dependencies in the 
>> connector. This means the 2 connectors will likely diverge in implementation 
>> in the near future.
>> 2. adding a medium item for Trino and PrestoDB Avro support
>> 3. adding a small item for Trino and PrestoDB full system table support (the 
>> system table schema in them are diverging from core, and missing a few 
>> latest system tables)
>> 
>> For the items listed with "Spec" and "Spec v3", what are the key 
>> differences? I thought we are treating any new spec changes after the format 
>> v2 vote as v3.
>> 
>> Best,
>> Jack Ye
>> 
>> On Mon, Sep 13, 2021 at 7:13 AM Gidon Gershinsky > <mailto:gg5...@gmail.com>> wrote:
>> Hi Ryan,
>> 
>> I just wonder if the encryption should be a Spec v3 category. We have the 
>> key_metadata fields in both data_file and manifest_f

Re: Snapshot tagging, branching and retention

2021-10-18 Thread Kyle Bendickson
Thanks for collecting these notes as well as for the proposal, Jack. Have
been traveling today so I couldn't attend.

Will be looking out for the new PR.

Best,
Kyle Bendickson (@kbendick)


On Mon, Oct 18, 2021 at 9:58 AM Jack Ye  wrote:

> Thanks to everyone who came to the meeting for the discussion. Here is the
> meeting note:
> https://docs.google.com/document/d/1yVxvgQfGDUdKsr6j60jL54LKZSUBvLy9QEQstVNrWYQ/edit#
>
> As the next step, I will proceed with implementation in the current open
> PR https://github.com/apache/iceberg/pull/3104, and also publish a new PR
> to document all the spec changes.
>
> Best,
> Jack Ye
>
> On Wed, Oct 13, 2021 at 8:23 PM Jack Ye  wrote:
>
>> Sure, I will take note and publish it to this thread.
>> -Jack
>>
>> On Wed, Oct 13, 2021 at 7:18 PM OpenInx  wrote:
>>
>>> Is it possible to maintain a meeting note for this and publish it to the
>>> mail list because I don't think everybody could attend this meeting ?
>>>
>>> Thanks.
>>>
>>> On Thu, Oct 14, 2021 at 2:00 AM Jack Ye  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Based on some offline discussions with different people around
>>>> availability, we will hold the meeting on Monday 10/18 9am PDT.
>>>>
>>>> Here is the meeting link: meet.google.com/ubj-kvfm-ehg
>>>>
>>>> I have added all the people in this thread to the invite. Feel free to
>>>> also forward the meeting to anyone else interested.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> On Mon, Oct 11, 2021 at 8:53 AM Eduard Tudenhoefner 
>>>> wrote:
>>>>
>>>>> Hey Jack,
>>>>>
>>>>> would this week on Wednesday work for you from 9 to 10am PDT?
>>>>>
>>>>> On Thu, Oct 7, 2021 at 7:41 PM Jack Ye  wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We have had a few iterations of the design doc with various people,
>>>>>> thanks for all the feedback. I am thinking about a meeting to finalize 
>>>>>> the
>>>>>> design and move forward with implementation.
>>>>>>
>>>>>> Considering the various time zones, I propose we choose any time from
>>>>>> Tuesday (10/12) to Friday (10/15), 8-10am PDT, 1 hour meeting slot.
>>>>>>
>>>>>> If anyone is interested in joining, please let me know the preferred
>>>>>> time slot.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 11:29 PM Eduard Tudenhoefner <
>>>>>> edu...@dremio.com> wrote:
>>>>>>
>>>>>>> Nice work Jack, the proposal looks really good.
>>>>>>>
>>>>>>> On Sun, Aug 29, 2021 at 9:20 AM Jack Ye  wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Recently I have published PR 2961 - add snapshot tags interface (
>>>>>>>> https://github.com/apache/iceberg/pull/2961) and received a lot of
>>>>>>>> great feedback. I have summarized everything in the discussions and 
>>>>>>>> put up
>>>>>>>> a design to discuss the path forward around snapshot tagging, 
>>>>>>>> branching and
>>>>>>>> retention:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit?usp=sharing
>>>>>>>>
>>>>>>>> Any feedback around the doc would be much appreciated!
>>>>>>>>
>>>>>>>> Also, to facilitate future changes in Iceberg spec, it would be
>>>>>>>> very helpful to take a look at 2597 - Core: introduce 
>>>>>>>> TableMetadataBuilder (
>>>>>>>> https://github.com/apache/iceberg/pull/2957) which would make
>>>>>>>> changing TableMetadata much simpler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>


Iceberg 0.12.1 Patch Release - Call for Bug Fixes and Patches

2021-10-20 Thread Kyle Bendickson
As mentioned in today's community sync up, we're planning on releasing a
new point version of Iceberg - Apache Iceberg 0.12.1.

If there are any outstanding bugs you'd like to include fixes for or other
minor patches, please respond to this email thread letting us know.

The current list of patches to be included can be found in the milestone on
Github: https://github.com/apache/iceberg/milestone/15?closed=1

As new items are added, they will be included in the milestone.

Best,
Kyle Bendickson [ Github: @kbendick ]


Re: Iceberg 0.12.1 Patch Release - Call for Bug Fixes and Patches

2021-10-21 Thread Kyle Bendickson
Thank you everybody for the additional PRs brought up so far.

I’ve volunteered to be release manager, so will be doing my best to go
through and ensure these are prioritized for consideration (if some are
truly new features they might need to wait for 0.13.0, but as I’m just the
release manager that will be mire up to the community).

If any committers or contributors have free cycles and are willing to
review some of these PRs, that would be greatly appreciated!

- Kyle Bendickson [@kbendick]

On Thu, Oct 21, 2021 at 11:19 AM Peter Vary 
wrote:

> Just to make this clean https://github.com/apache/iceberg/pull/3338 fixes
> the issue caused by https://github.com/apache/iceberg/pull/2565. The fix
> will make Catalogs.loadCatalog consistent with Catalogs.hiveCatalog, and
> fixing create table issues when no catalog is set in the config
>
> On 2021. Oct 21., at 16:59, Peter Vary  wrote:
>
> I would like to have this in 0.12.1:
> https://github.com/apache/iceberg/pull/3338
>
> This breaks Hive queries, if no catalog is set, but this still needs to be
> reviewed before merge.
>
> Thanks, Peter
>
>
> On Thu, 21 Oct 2021, 07:12 Rajarshi Sarkar,  wrote:
>
>> Hope this can get in: https://github.com/apache/iceberg/pull/3175
>>
>> Regards,
>> Rajarshi Sarkar,
>>
>>
>> On Thu, Oct 21, 2021 at 9:08 AM Cheng Pan  wrote:
>>
>>> Hope this can get in.
>>> https://github.com/apache/iceberg/pull/3203
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On Thu, Oct 21, 2021 at 11:34 AM Reo Lei  wrote:
>>>
>>>> Thanks Kyle for syncing this!
>>>>
>>>> I think PR#3240 should be include in this release. Because in our
>>>> Dingding group, we have received feedback from many flink users that they
>>>> encountered this problem. I think this PR is very important and we need to
>>>> fix this problem ASAP.
>>>>
>>>> link: https://github.com/apache/iceberg/pull/3240
>>>>
>>>> BR,
>>>> Reo LEI
>>>>
>>>> Kyle Bendickson  于2021年10月21日周四 上午2:52写道:
>>>>
>>>>> As mentioned in today's community sync up, we're planning on releasing
>>>>> a new point version of Iceberg - Apache Iceberg 0.12.1.
>>>>>
>>>>> If there are any outstanding bugs you'd like to include fixes for or
>>>>> other minor patches, please respond to this email thread letting us know.
>>>>>
>>>>> The current list of patches to be included can be found in the
>>>>> milestone on Github:
>>>>> https://github.com/apache/iceberg/milestone/15?closed=1
>>>>>
>>>>> As new items are added, they will be included in the milestone.
>>>>>
>>>>> Best,
>>>>> Kyle Bendickson [ Github: @kbendick ]
>>>>>
>>>>
>


Re: Help improve Iceberg community meeting experience

2021-10-23 Thread Kyle Bendickson
+1 for the suggestion Jack. The time limit has definitely been a point of
pain at times. And also, if somebody takes a week or two off of work it can
be really easy to miss things.

+1 for volunteering to help make this happen Sam! Please let me know if I
can help in any way!

I wonder if we can also centralize a list of upcoming meetings somewhere
for easier checking / access? It can be easy to miss something if your
inbox gets full rather quickly.

- Kyle (@kbendick)

On Fri, Oct 22, 2021 at 2:37 PM Wing Yew Poon 
wrote:

> I have no concerns with Tabular hosting and recording the meetings. I'm in
> favor of having the meetings recorded and the recordings available.
> - Wing Yew
>
>
> On Fri, Oct 22, 2021 at 1:59 PM John Zhuge  wrote:
>
>> +1
>>
>> It will be great to catch up on the meetings missed.
>>
>> On Fri, Oct 22, 2021 at 12:16 PM Yufei Gu  wrote:
>>
>>> +1 for recording the meetings. It's especially valuable for the design
>>> discussions. I'd suggest adding something like "this meeting will be
>>> recorded'' to the event when people send out the invitation.
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Fri, Oct 22, 2021 at 12:05 PM Sam Redai  wrote:
>>>
 Thanks for raising this Jack! If it's ok with everyone, we can host and
 record the google meets via our Tabular account. I'll volunteer to set up
 and maintain this as well as uploading the recordings.

 -Sam

 On Fri, Oct 22, 2021 at 11:54 AM Jack Ye  wrote:

> Hi everyone,
>
> Recently we have been hosting an increasing number of meetings for
> design discussions and community syncs as more people are getting
> interested in Iceberg and start to contribute exciting features. Right now
> we are relying on individuals to send out meeting invites using free apps,
> but we are restricted by the limitations of those apps.
>
> I wonder if there is a way for us to leverage a paid remote meeting
> service through Apache foundation or any other organization who is willing
> to sponsor such community meetings. This would allow us to have the
> following benefits:
>
> 1. We are no longer restricted by the 1 hour time limit for Google
> Meet (40 min for Zoom).
>
> 2. We can record the entire meeting and publish it to sites like
> YouTube, so that people who cannot join the meeting can review the entire
> content instead of just meeting note summary.
>
> This would be very beneficial given the fact that we have pretty big
> communities in US, Europe and Asia time zones, and most meetings can only
> satisfy 2 time zones at best.
>
> I have asked AWS internally but we can only offer free use of AWS
> Chime, which is not a very popular choice and would probably result in
> fewer people joining the meetings.
>
> Any thoughts around this area?
>
> Best,
> Jack Ye
>
>
>
>
>>
>> --
>> John Zhuge
>>
>


Re: Iceberg 0.12.1 Patch Release - Call for Bug Fixes and Patches

2021-10-27 Thread Kyle Bendickson
or this Hive fix?
>
> On Wed, Oct 27, 2021 at 3:17 AM OpenInx  wrote:
>
>> I think we will need to fix this critical iceberg bug before we release
>> the 0.12.1: https://github.com/apache/iceberg/issues/3393 . Let's mark
>> it as a blocker for the 0.12.1.
>>
>> On Fri, Oct 22, 2021 at 3:22 AM Kyle Bendickson  wrote:
>>
>>> Thank you everybody for the additional PRs brought up so far.
>>>
>>> I’ve volunteered to be release manager, so will be doing my best to go
>>> through and ensure these are prioritized for consideration (if some are
>>> truly new features they might need to wait for 0.13.0, but as I’m just the
>>> release manager that will be mire up to the community).
>>>
>>> If any committers or contributors have free cycles and are willing to
>>> review some of these PRs, that would be greatly appreciated!
>>>
>>> - Kyle Bendickson [@kbendick]
>>>
>>> On Thu, Oct 21, 2021 at 11:19 AM Peter Vary 
>>> wrote:
>>>
>>>> Just to make this clean https://github.com/apache/iceberg/pull/3338 fixes
>>>> the issue caused by https://github.com/apache/iceberg/pull/2565. The
>>>> fix will make Catalogs.loadCatalog consistent with Catalogs.hiveCatalog,
>>>> and fixing create table issues when no catalog is set in the config
>>>>
>>>> On 2021. Oct 21., at 16:59, Peter Vary  wrote:
>>>>
>>>> I would like to have this in 0.12.1:
>>>> https://github.com/apache/iceberg/pull/3338
>>>>
>>>> This breaks Hive queries, if no catalog is set, but this still needs to
>>>> be reviewed before merge.
>>>>
>>>> Thanks, Peter
>>>>
>>>>
>>>> On Thu, 21 Oct 2021, 07:12 Rajarshi Sarkar, 
>>>> wrote:
>>>>
>>>>> Hope this can get in: https://github.com/apache/iceberg/pull/3175
>>>>>
>>>>> Regards,
>>>>> Rajarshi Sarkar,
>>>>>
>>>>>
>>>>> On Thu, Oct 21, 2021 at 9:08 AM Cheng Pan  wrote:
>>>>>
>>>>>> Hope this can get in.
>>>>>> https://github.com/apache/iceberg/pull/3203
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 21, 2021 at 11:34 AM Reo Lei  wrote:
>>>>>>
>>>>>>> Thanks Kyle for syncing this!
>>>>>>>
>>>>>>> I think PR#3240 should be include in this release. Because in our
>>>>>>> Dingding group, we have received feedback from many flink users that 
>>>>>>> they
>>>>>>> encountered this problem. I think this PR is very important and we need 
>>>>>>> to
>>>>>>> fix this problem ASAP.
>>>>>>>
>>>>>>> link: https://github.com/apache/iceberg/pull/3240
>>>>>>>
>>>>>>> BR,
>>>>>>> Reo LEI
>>>>>>>
>>>>>>> Kyle Bendickson  于2021年10月21日周四 上午2:52写道:
>>>>>>>
>>>>>>>> As mentioned in today's community sync up, we're planning on
>>>>>>>> releasing a new point version of Iceberg - Apache Iceberg 0.12.1.
>>>>>>>>
>>>>>>>> If there are any outstanding bugs you'd like to include fixes for
>>>>>>>> or other minor patches, please respond to this email thread letting us 
>>>>>>>> know.
>>>>>>>>
>>>>>>>> The current list of patches to be included can be found in the
>>>>>>>> milestone on Github:
>>>>>>>> https://github.com/apache/iceberg/milestone/15?closed=1
>>>>>>>>
>>>>>>>> As new items are added, they will be included in the milestone.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kyle Bendickson [ Github: @kbendick ]
>>>>>>>>
>>>>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>


[VOTE] Release Apache Iceberg 0.12.1 RC0

2021-11-02 Thread Kyle Bendickson
Hi everyone,


I propose the following RC to be released as the official Apache Iceberg
0.12.1 release.


The commit id is d4052a73f14b63e1f519aaa722971dc74f8c9796

* This corresponds to the tag: apache-iceberg-0.12.1-rc0

* https://github.com/apache/iceberg/commits/apache-iceberg-0.12.1-rc0

*
https://github.com/apache/iceberg/tree/d4052a73f14b63e1f519aaa722971dc74f8c9796


The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.1-rc0/


You can find the KEYS file here:

* https://dist.apache.org/repos/dist/dev/iceberg/KEYS


Convenience binary artifacts are staged in Nexus. The Maven repository URL
is:

* https://repository.apache.org/content/repositories/orgapacheiceberg-1019/


This release includes the following changes:

https://github.com/apache/iceberg/compare/apache-iceberg-0.12.0...apache-iceberg-0.12.1-rc0


Please download, verify, and test.


Please vote in the next 72 hours.


[ ] +1 Release this as Apache Iceberg 

[ ] +0

[ ] -1 Do not release this because...

-- 
Best,
Kyle Bendickson
Github: @kbendick


Standard practices around PRs against multiple Spark versions

2021-11-03 Thread Kyle Bendickson
I submitted a PR to fix a Spark bug today, applying the same changes to all
eligible Spark versions.

Jack mentioned that he thought the practice going forward was to fix /
apply changes on the latest Spark version in one PR, and then open a second
PR to backport the fixes (presumably to minimize review overhead).

Do we have a standard / preference on that? Jack mentioned he wasn't
certain, so I thought I'd ask here.

Seems like a good practice but hoping to get some clarification :)

-- 
Best,
Kyle Bendickson
Github: @kbendick


Re: [VOTE] Release Apache Iceberg 0.12.1 RC0

2021-11-04 Thread Kyle Bendickson
+1 (binding)

- Validated checksums, signatures, and licenses
-  Ran all of the unit tests
- Imported Files from Orc tables via Spark stored procedure, with floating
point type columns and inspected the metrics afterwards
- Registered and used bucketed UDFs for various types such as integer and
byte
- Created and dropped tables
- Ran MERGE INTO queries using Spark DDL
- Verified ability to read tables with parquet files with nested map type
schema from various versions (both before and after Parquet 1.11.0 ->
1.11.1 upgrade)
- Tried to set a tblproperty to null (received error as expected)
- Full unit test suite
- Ran several Flink queries, both batch and streaming.
- Tested against a custom catalog

My spark configuration was very similar to Ryan’s. I used Flink 1.12.1 on a
docker-compose setup via the Flink SQL client with 2 task managers.

In addition to testing with a custom catalog, I also tested with HMS / Hive
catalog with HDFS as storage as well as Hadoop Catalog with data on (local)
HDFS.

I’ve not gotten the Hive3 errors despite running unit tests several times.

- Kyle (@kbendick)


On Thu, Nov 4, 2021 at 9:57 PM Daniel Weeks  wrote:

> +1 (binding)
>
> Verified sigs, sums, license, build and test.
>
> -Dan
>
> On Thu, Nov 4, 2021 at 4:30 PM Ryan Blue  wrote:
>
>> +1 (binding)
>>
>>- Validated checksums, checked signature, ran tests (still a couple
>>failing in Hive3)
>>- Staged binaries from the release tarball
>>- Tested Spark metadata tables
>>- Used rewrite_manifests stored procedure in Spark
>>- Updated to v2 using SET TBLPROPERTIES
>>- Dropped and added partition fields
>>- Replaced a table with itself using INSERT OVERWRITE
>>- Tested custom catalogs
>>
>> Here’s my Spark config script in case anyone else wants to validate:
>>
>> /home/blue/Apps/spark-3.1.1-bin-hadoop3.2/bin/spark-shell \
>> --conf 
>> spark.jars.repositories=https://repository.apache.org/content/repositories/orgapacheiceberg-1019/
>>  \
>> --packages org.apache.iceberg:iceberg-spark3-runtime:0.12.1 \
>> --conf 
>> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>>  \
>> --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
>> --conf spark.sql.catalog.local.type=hadoop \
>> --conf spark.sql.catalog.local.warehouse=/home/blue/tmp/hadoop-warehouse 
>> \
>> --conf spark.sql.catalog.local.default-namespace=default \
>> --conf spark.sql.catalog.prodhive=org.apache.iceberg.spark.SparkCatalog \
>> --conf spark.sql.catalog.prodhive.type=hive \
>> --conf 
>> spark.sql.catalog.prodhive.warehouse=/home/blue/tmp/prod-warehouse \
>> --conf spark.sql.catalog.prodhive.default-namespace=default \
>> --conf spark.sql.defaultCatalog=local
>>
>>
>> On Thu, Nov 4, 2021 at 1:02 PM Jack Ye  wrote:
>>
>>> +1, non-binding
>>>
>>> ran checksum, build, unit tests, AWS integration tests and verified
>>> fixes in EMR 6.4.0.
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Tue, Nov 2, 2021 at 7:16 PM Kyle Bendickson  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>>
>>>> I propose the following RC to be released as the official Apache
>>>> Iceberg 0.12.1 release.
>>>>
>>>>
>>>> The commit id is d4052a73f14b63e1f519aaa722971dc74f8c9796
>>>>
>>>> * This corresponds to the tag: apache-iceberg-0.12.1-rc0
>>>>
>>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.12.1-rc0
>>>>
>>>> *
>>>> https://github.com/apache/iceberg/tree/d4052a73f14b63e1f519aaa722971dc74f8c9796
>>>>
>>>>
>>>> The release tarball, signature, and checksums are here:
>>>>
>>>> *
>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.1-rc0/
>>>>
>>>>
>>>> You can find the KEYS file here:
>>>>
>>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>
>>>>
>>>> Convenience binary artifacts are staged in Nexus. The Maven repository
>>>> URL is:
>>>>
>>>> *
>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1019/
>>>>
>>>>
>>>> This release includes the following changes:
>>>>
>>>>
>>>> https://github.com/apache/iceberg/compare/apache-iceberg-0.12.0...apache-iceberg-0.12.1-rc0
>>>>
>>>>
>>>> Please download, verify, and test.
>>>>
>>>>
>>>> Please vote in the next 72 hours.
>>>>
>>>>
>>>> [ ] +1 Release this as Apache Iceberg 
>>>>
>>>> [ ] +0
>>>>
>>>> [ ] -1 Do not release this because...
>>>>
>>>> --
>>>> Best,
>>>> Kyle Bendickson
>>>> Github: @kbendick
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Standard practices around PRs against multiple Spark versions

2021-11-05 Thread Kyle Bendickson
I like Yufei's idea of adding a template.

As per the concern about making changes separate and having things fall
through the cracks, I do share that concern.

On a related note though, I've already observed that a few times with
simple things that maybe people were working on before the split up. These
will go away with time, though could happen in the future when we add a new
version.

I agree a PR template would go a long way in ensuring that people are at
least aware.

It would be great if more folks who maintain forks could chime in as well,
as this has ramifications for that as well.

But overall, especially for larger PRs, I think it's a great idea.

- Kyle

On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu  wrote:

> It is a good practice to create a template for Github Issues and PRs
> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>.
>  It
> will make a PR or issue easier to read overall. We can also enforce/suggest
> properties like affected versions and fixed versions.
>
> Best,
> Yufei
>
>
> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon 
> wrote:
>
>> I wasn't aware that we were standardizing on such a practice. I don't
>> have a strong opinion on making changes one Spark version at a time or all
>> at once. I think committers who do reviews regularly should decide. My only
>> concern with making changes one version at a time is follow-through on the
>> part of the contributor. We want to ensure that a change/fix applicable to
>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>> either way. (For simple changes/fixes, perhaps all at once does save
>> reviewer bandwidth, so we may want to be flexible.)
>>
>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye  wrote:
>>
>>> Thanks for bringing this up Kyle!
>>>
>>> My personal view is the following:
>>> 1. For new features, it should be very clear that we always implement
>>> them against the latest version. At the same time, I suggest we create an
>>> issue to track backport, so that if anyone is interested in backport he/she
>>> can work on it separately. We can tag these issues based on title names
>>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>>> contributors to work on because there is already reference implementation
>>> in a newer version.
>>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>>> people will try to just fix across versions. My take is that we should try
>>> to advocate for fixing 1 version and open an issue for other versions
>>> although it does not really need to be enforced as strictly. Sometimes even
>>> a 1 line fix has serious implications for different versions and might
>>> break stuff unintentionally. it's better that people that have production
>>> dependency on the specific version carefully review and test changes before
>>> merging.
>>>
>>> About enforcement strategy, I suggest we start to create a template for
>>> Github Issues and PRs
>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>>> where we state the guidelines related to engine versions, as well as
>>> Iceberg's preferred code style, naming convention, title convention, etc.
>>> to make new contributors a bit easier to submit changes without too much
>>> rewrite. Currently I observe that every time there is a new contributor, we
>>> need to state all the guidelines through PR review, which causes quite a
>>> lot of time spent on rewriting the code and also reduces the motivation for
>>> people to continue work on the PR.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson  wrote:
>>>
>>>> I submitted a PR to fix a Spark bug today, applying the same changes to
>>>> all eligible Spark versions.
>>>>
>>>> Jack mentioned that he thought the practice going forward was to fix /
>>>> apply changes on the latest Spark version in one PR, and then open a second
>>>> PR to backport the fixes (presumably to minimize review overhead).
>>>>
>>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>>> certain, so I thought I'd ask here.
>>>>
>>>> Seems like a good practice but hoping to get some clarification :)
>>>>
>>>> --
>>>> Best,
>>>> Kyle Bendickson
>>>> Github: @kbendick
>>>>
>>>


Re: Standard practices around PRs against multiple Spark versions

2021-11-06 Thread Kyle Bendickson
+1 as well. I too have found most places that have a bigger PR template,
its utility goes down as it gets larger.

And smaller patches are typically how we get new contributors. Or they want
to submit one patch in an area that some of the regular contributors might
not work with much, but they don’t otherwise have that much time to be
active contributors.

The template should be friendly for new people who possibly just want to
fix one thing that affects them - often those patches are very valuable.

One thing that might be nice in the future would be having a bot that opens
issues for the backport, possibly via slash commands. So the committer who
merged it in can possibly put `/icebergbot backport issue spark 3.1, 3.0`.

I know that’s likely more work than we have bandwidth for at present, but
putting the idea out as we work through the issues surrounding slicing our
tooling for many versions.

- Kyle

On Fri, Nov 5, 2021 at 2:59 PM Jack Ye  wrote:

> definitely +1 for having documentations to point to instead of laying
> everything out in the PR template!
>
> -Jack
>
> On Fri, Nov 5, 2021 at 2:55 PM Ryan Blue  wrote:
>
>> I agree with Jack about keeping changes targeted at one Spark version.
>> That makes it possible to revert changes to one Spark version rather than
>> manually rolling back. And it also keeps PRs smaller and more focused. I
>> think it wastes more contributor time to make updates to a PR in 3 or 4
>> places and keep them in sync.
>>
>> For the template for issues and PRs, I'm okay with that, but let's not
>> get carried away. Let's document style, standards, and practices and link
>> to that doc rather than starting off with a wall of text. The templates for
>> other projects, like Parquet, have so much content that isn't useful that I
>> think they're encouraging people with small commits to abandon the effort.
>>
>> Ryan
>>
>> On Fri, Nov 5, 2021 at 1:06 PM Kyle Bendickson  wrote:
>>
>>> I like Yufei's idea of adding a template.
>>>
>>> As per the concern about making changes separate and having things fall
>>> through the cracks, I do share that concern.
>>>
>>> On a related note though, I've already observed that a few times with
>>> simple things that maybe people were working on before the split up. These
>>> will go away with time, though could happen in the future when we add a new
>>> version.
>>>
>>> I agree a PR template would go a long way in ensuring that people are at
>>> least aware.
>>>
>>> It would be great if more folks who maintain forks could chime in as
>>> well, as this has ramifications for that as well.
>>>
>>> But overall, especially for larger PRs, I think it's a great idea.
>>>
>>> - Kyle
>>>
>>> On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu  wrote:
>>>
>>>> It is a good practice to create a template for Github Issues and PRs
>>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>.
>>>>  It
>>>> will make a PR or issue easier to read overall. We can also enforce/suggest
>>>> properties like affected versions and fixed versions.
>>>>
>>>> Best,
>>>> Yufei
>>>>
>>>>
>>>> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon
>>>>  wrote:
>>>>
>>>>> I wasn't aware that we were standardizing on such a practice. I don't
>>>>> have a strong opinion on making changes one Spark version at a time or all
>>>>> at once. I think committers who do reviews regularly should decide. My 
>>>>> only
>>>>> concern with making changes one version at a time is follow-through on the
>>>>> part of the contributor. We want to ensure that a change/fix applicable to
>>>>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>>>>> either way. (For simple changes/fixes, perhaps all at once does save
>>>>> reviewer bandwidth, so we may want to be flexible.)
>>>>>
>>>>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye  wrote:
>>>>>
>>>>>> Thanks for bringing this up Kyle!
>>>>>>
>>>>>> My personal view is the following:
>>>>>> 1. For new features, it should be very clear that we always implement
>>>>>> them against the latest version. At the same time, I suggest we create an
>>>>>> issue t

Re: [DISCUSS] Iceberg roadmap

2021-11-07 Thread Kyle Bendickson
+1 around concerns with the Elastic license.

Also, more importantly, how important is integration with either of these
tools to the Iceberg community and contributors?

The Elastic license makes a bit more sense for elasticsearch, as it was an
existing project for quite some time. I won’t reiterate the details of that
situation, but it’s odd to see a fork of a new, active project using the
Elastic license in my opinion.

StarRocks admits that they’re at least 40% of code from the Apache Doris
project.

That said, StarRocks claims to not require other dependencies. It seems
StarRocks supports query federation with a few tools so as not to have to
import the data and query those systems directly. So I’m not sure what
Iceberg support would look like beyond additional query federation. What
benefit does this provide?

If we determined that integration with one of these tools was something the
community valued, could a connector be built to target the Apache Doris
project and then StarRocks could fork that code if they liked?

- Kyle Bendickson
GitHub @kbendick



On Sun, Nov 7, 2021 at 9:24 PM Reo Lei  wrote:

> +1, I have the same concern for the incompatible license.
>
> Jacques Nadeau  于2021年11月8日周一 上午11:48写道:
>
>> A few additional observations about StarRocks...
>>
>> - As far as I can tell, StarRocks has an ASF incompatible license
>> (Elastic License 2.0).
>> - It appears to be a hard fork of Apache Doris, a project still in the
>> incubator (and looks like it probably is destructive to the Doris project)
>> - The project has only existed for ~2 months.
>>
>>
>>
>>
>>
>> On Sun, Nov 7, 2021 at 7:34 PM OpenInx  wrote:
>>
>>> Any thoughts for adding StarRocks integration to the roadmap ?
>>>
>>> I think the guys from StarRocks community can provide more background
>>> and inputs.
>>>
>>> On Thu, Nov 4, 2021 at 5:59 PM OpenInx  wrote:
>>>
>>>> Update:
>>>>
>>>> StarRocks[1] is a next-gen sub-second MPP database for full analysis
>>>> scenarios, including multi-dimensional analytics, real-time analytics and
>>>> ad-hoc query.  Their team is planning to integrate iceberg tables as
>>>> StarRocks external tables in the next month [2], so that people could
>>>> connect the data lake and StarRocks warehouse in the same engine.
>>>> The excellent performance of StarRocks will also help accelerate the
>>>> analysis and access of the iceberg table, I think this is a great thing for
>>>> both the iceberg community and the StarRocks community.   I think we can
>>>> add an extra project about StarRocks integration work in the apache iceberg
>>>> roadmap [3] ?
>>>>
>>>> [1].  https://github.com/StarRocks/starrocks
>>>> [2].  https://github.com/StarRocks/starrocks/issues/1030
>>>> [3].  https://github.com/apache/iceberg/projects
>>>>
>>>> On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue  wrote:
>>>>
>>>>> I closed the upgrade project and marked the FLIP-27 project priority
>>>>> 1. Thanks for all the work to get this done!
>>>>>
>>>>> On Sun, Oct 31, 2021 at 8:10 PM OpenInx  wrote:
>>>>>
>>>>>> Update:
>>>>>>
>>>>>> I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can be
>>>>>> closed now, because all of the issues have been addressed.
>>>>>>
>>>>>> [1]. https://github.com/apache/iceberg/projects/12
>>>>>>
>>>>>> On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner <
>>>>>> edu...@dremio.com> wrote:
>>>>>>
>>>>>>> I created a Roadmap section in
>>>>>>>  https://github.com/apache/iceberg/pull/3163
>>>>>>> <https://github.com/apache/iceberg/pull/3163> that links to the
>>>>>>> planning boards that Jack created. I figured it makes sense if we link
>>>>>>> available Design Docs directly on those Boards (as was already done),
>>>>>>> because then the Design docs are closer to the set of related issues.
>>>>>>>
>>>>>>> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> Thanks, Jack!
>>>>>>>>
>>>>>>>> Eduard, I think that's a good idea. We should have a roadmap page
>>>>>>>> as well that links to the projects that Jack just created.
>>>>>>>>
>>&g

[RESULT] [VOTE] Release Apache Iceberg 0.12.1

2021-11-08 Thread Kyle Bendickson
With 8 +1 votes and no +0 or -1 votes, this passes.

Thanks everyone for looking into the release candidate and taking the time
to vote.

And thank you very much to everyone that contributes to the project!
Whether it be through new features, bug fixes, documentation, tests, or
contributing to the community in some other way. All of that collective
work is highly valued, and we couldn't have the wonderful project and
inviting community we do without all of that.

I will work on getting some PRs up for the release's changelog as well as
ensuring the artifacts are published to their final destination.

Kyle [Github @kbendick]

On Sun, Nov 7, 2021 at 6:00 PM Steven Wu  wrote:

> +1 (non-binding) verified signature and checksum, build and test passed.
>
> On Fri, Nov 5, 2021 at 6:10 PM Sam Redai  wrote:
>
>> +1 (non-binding) signature, checksum, license, build and test
>>
>> On Fri, Nov 5, 2021 at 12:36 AM OpenInx  wrote:
>>
>>> +1  (binding)
>>>
>>> 1. Download the source tarball, signature (.asc), and checksum
>>> (.sha512):   OK
>>> 2. Import gpg keys: download KEYS and run gpg --import
>>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>>> 3. Verify the signature by running: gpg --verify
>>> apache-iceberg-xx-incubating.tar.gz.asc:  OK
>>> 4. Verify the checksum by running: shasum -a 256 -c
>>> apache-iceberg-0.12.1.tar.gz.sha512 apache-iceberg-0.12.1.tar.gz :  OK
>>> 5. Untar the archive and go into the source directory: tar xzf
>>> apache-iceberg-xx-incubating.tar.gz && cd apache-iceberg-xx-incubating:  OK
>>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>> 8. Check the flink works fine by the following command line:
>>>
>>> ./bin/sql-client.sh embedded -j
>>> /Users/openinx/Downloads/apache-iceberg-0.12.1/flink-runtime/build/libs/iceberg-flink-runtime-0.12.1.jar
>>> shell
>>>
>>> CREATE CATALOG hadoop_prod WITH (
>>> 'type'='iceberg',
>>> 'catalog-type'='hadoop',
>>> 'warehouse'='file:///Users/openinx/test/iceberg-warehouse'
>>> );
>>>
>>> CREATE TABLE `hadoop_prod`.`default`.`flink_table` (
>>> id BIGINT,
>>> data STRING
>>> );
>>>
>>> INSERT INTO `hadoop_prod`.`default`.`flink_table` VALUES (1, 'AAA');
>>> SELECT * FROM `hadoop_prod`.`default`.`flink_table`;
>>> ++--+
>>> | id | data |
>>> ++--+
>>> | 1 | AAA |
>>> ++--+
>>> 1 row in set
>>>
>>> Thanks all for the work.
>>>
>>> On Fri, Nov 5, 2021 at 2:20 PM Cheng Pan  wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> The integration test based on the master branch of Apache Kyuubi
>>>> (Incubating) passed.
>>>>
>>>> https://github.com/apache/incubator-kyuubi/pull/1338
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>> On Fri, Nov 5, 2021 at 1:19 PM Kyle Bendickson  wrote:
>>>> >
>>>> >
>>>> > +1 (binding)
>>>> >
>>>> > - Validated checksums, signatures, and licenses
>>>> > -  Ran all of the unit tests
>>>> > - Imported Files from Orc tables via Spark stored procedure, with
>>>> floating point type columns and inspected the metrics afterwards
>>>> > - Registered and used bucketed UDFs for various types such as integer
>>>> and byte
>>>> > - Created and dropped tables
>>>> > - Ran MERGE INTO queries using Spark DDL
>>>> > - Verified ability to read tables with parquet files with nested map
>>>> type schema from various versions (both before and after Parquet 1.11.0 ->
>>>> 1.11.1 upgrade)
>>>> > - Tried to set a tblproperty to null (received error as expected)
>>>> > - Full unit test suite
>>>> > - Ran several Flink queries, both batch and streaming.
>>>> > - Tested against a custom catalog
>>>> >
>>>> > My spark configuration was very similar to Ryan’s. I used Flink
>>>> 1.12.1 on a docker-compose setup via the Flink SQL client with 2 task
>>>> managers.
>>>> >
>>>> > In addition to testing with a custom catalog, I also tested with HMS
>>>> / Hive catalog with HDFS as storage as well as Hadoop Catalog with data on
>>

[ANNOUNCE] Apache Iceberg release 0.12.1

2021-11-10 Thread Kyle Bendickson
I'm pleased to announce the release of Apache Iceberg 0.12.1!

Apache Iceberg is an open table format for huge analytic datasets. Iceberg
delivers high query performance for tables with tens of petabytes of data,
along with atomic commits, concurrent writes, and SQL-compatible table
evolution.

This release can be downloaded from
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-0.12.1/apache-iceberg-0.12.1.tar.gz

Java artifacts are available from Maven Central.

Thanks to everyone for contributing!

-- 
Best,
Kyle Bendickson
Github: @kbendick


Re: Welcome new PMC members!

2021-11-17 Thread Kyle Bendickson
Congratulations to both Jack and Russell!

Very we deserved indeed :)

On Wed, Nov 17, 2021 at 4:12 PM Ryan Blue  wrote:

> Hi everyone, I want to welcome Jack Ye and Russell Spitzer to the Iceberg
> PMC. They've both been amazing at reviewing and helping people in the
> community and the PMC has decided to invite them to join. Congratulations,
> Jack and Russell! Thank you for all your hard work and support for the
> project.
>
> Ryan
>
> --
> Ryan Blue
>


Re: Proposal: Switch docs site from mkdocs to hugo and relocate to a separate iceberg-docs repo

2021-11-29 Thread Kyle Bendickson
Wow, the prototype looks great, Sam!

I'd like to add a little bit about possible avenues for hosting to explore
and other corner areas.

I only have one thing to add:

1) For the latest docs, can we consider including a warning message on the
page that this is for the master version.

Apache Flink has this, and on several occasions it has helped me out. Their
doc string reads: "This documentation is for an unreleased version of
Apache Flink. We recommend you use the latest stable version ".

Overall the site looks great. Thank you, Sam!

On Sun, Nov 28, 2021 at 11:03 PM Sam Redai  wrote:

> Thanks Jack! To your questions:
>
> 1. In addition to Hugo, I tried out Pelican and Gatsby. (“Tried out”
> meaning spent an afternoon fooling around with it)
>
> Pelican felt easy to use but doing anything custom like a landing page
> required a lot of theme and site config customizations. The live reloading
> also felt sluggish once I added in all of the content.
>
> Gatsby seems really flexible and powerful but it requires some knowledge
> of React that could discourage some community contributions in the future.
>
> Hugo on the other hand, in a little over an hour I was able to get the
> site together with the landing page, and another hour the next day I had
> asciinema added and the versioned docs working via GitHub (all with no
> prior experience with the framework). I definitely could have either of the
> other frameworks misunderstood. Some other frameworks out there that I
> haven’t looked deeply into are Jekyll, Hexo, and Nuxt. If anyone has strong
> preferences for a particular framework, let me know and I can explore it
> further.
>
> 2. The “latest” site is a branch itself. We can actually create as many
> branches as we’d like and each would be deployed as a separate site. We
> would just have to update the releases section to include the relevant
> hrefs. One thing I forgot to mention is that PRs are also deployed and we
> could do something clever here like include a link in the PR template that
> links to how the PR changes looks fully deployed.
>
> 3. I was thinking we would keep a copy of the docs in the main iceberg
> repo where the main commits occur. As part of the iceberg version release
> process, the docs would be copied over to the iceberg-docs repo in a branch
> named after the release version. Hotfixes or typo corrections for previous
> versions could be done via pull requests directly to that branch in the
> iceberg-docs repo. That being said, I believe it’s possible to keep the
> docs in the same repo but it would require some magic that may feel
> somewhat fragile. For example the branch names such as 0.12.x wouldn’t work
> well if we want to have a different docs site for 0.12.0 and 0.12.1, we
> could probably work around this by adding some kind of regex to the deploy
> workflow and maybe use tags (
> https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#filter-pattern-cheat-sheet
> ).
>
> -Sam
>
> On Sat, Nov 27, 2021 at 5:59 PM Jack Ye  wrote:
>
>> The website looks amazing, thanks for the work!!!
>>
>> Some questions I have:
>> 1. you mentioned that you compared a few different static site
>> frameworks. Just for bookkeeping purposes, could you list what frameworks
>> you have compared so that people can have more clarity over the decision
>> for hugo?
>> 2. In the website, I see the latest doc points to 0.12.1. Is it possible
>> to have a version named "Next" that shows the latest doc in the master
>> branch?
>> 3. For the separation of docs to another repo, I remember we discussed
>> the topic in the past and we decided to not do it because many people
>> expressed that it's valuable for docs to be in the same repo so that they
>> can easily view and edit it. Given that we now have the iceberg-docs repo,
>> do we plan to run a sync job to copy the docs to that repo, or are you
>> thinking about revisiting the decision to fully move the docs to
>> iceberg-docs? It would be helpful if you can provide more details in this
>> area.
>>
>> Best,
>> Jack Ye
>>
>> On Sat, Nov 27, 2021 at 8:52 AM Sam Redai  wrote:
>>
>>> Hey everyone,
>>>
>>> I wanted to bring to everyone's attention an issue that I opened today
>>> that's a proposal for switching to using hugo for the iceberg documentation
>>> site. https://github.com/apache/iceberg/issues/3616
>>>
>>> I've deployed a prototype of what the site would look like and how it
>>> achieves some things still left desired for the current docs site (landing
>>> page, branch based versioning, etc). Please check it out when you have a
>>> chance and let me know what you all think!
>>> https://samredai.github.io/iceberg-docs-prototype/latest/
>>>
>>> -Sam
>>>
>>


Re: Iceberg event notification support

2021-11-30 Thread Kyle Bendickson
I think this is a great idea, Jack. Thank you for bringing this up! +1

There have been several people interested in having more observability (for
example for table design patterns akin to how folks might monitor Hive) and
events would be a big win for that and something users could use with a lot
of their existing infra (Kafka, REST services, AWS or other cloud provider
queue types).

Spark has an existing interface, ExternalCatalogWithListener, which emits
events we might hook into. I won't go into too much detail here. And while
these Spark "ExternalCatalogEvents" shouldn't be how we define our own
events, which should have their own type system, it could be a beneficial
source of event hooks from within Spark. It also provides us table level
query data we don't currently otherwise get. It's worth investigating if we
haven't, though we might choose to forgo it's complexity.

I agree conceptually that most events should be registered at the table
level, though I'd be open to having events of differing granularities.
Especially if this helps support cross-table patterns. But table level data
should be prioritized first.

If you have something to share or would like to make time to discuss,
please count me in. This is an area I've been thinking about a bit lately
as I've had quite some interest in observability and possible event-driven
patterns.

Best
Kyle (GitHub @kbendick)

On Tue, Nov 30, 2021 at 9:50 PM Neelesh Salian 
wrote:

> +1 to this effort.
> There is value in adding support for Events - general bookkeeping and
> helping replay actions in the event of recovery.
> At the minimum we should aim to track the following all catalogs:
> 1. Create actions
> 2. Alter actions
> 3. Delete actions
> across all tables, properties and namespaces.
>
>
>
> On Tue, Nov 30, 2021 at 9:12 PM Jack Ye  wrote:
>
>> Hi everyone,
>>
>> I would like to start some initial discussions around Iceberg event
>> notification support, because we might have some engineering resources to
>> work on Iceberg notification integration with AWS services such as SNS,
>> SQS, CloudWatch.
>>
>> As of today, we have a Listener interface and 3 events ScanEvent,
>> IncrementalScanEvent, CreateSnapshotEvent. There is a static registry
>> called Listeners that registers the event listeners in the JVM.
>>
>> However, when I read the related code paths, my thought is that it might
>> be better to register listeners per-table, based on the following
>> observations:
>> 1. Iceberg events are all table or sub-table level events. For any
>> catalog or global level events, the catalog service can provide
>> notifications, Iceberg can be out of the picture.
>> 2. A user might have multiple Iceberg catalogs defined, pointing to
>> different catalog services. (e.g. one to AWS Glue, one to a Hive
>> metastore). The notifications from tables of these different catalogs
>> should be directed to different listeners at least per catalog, instead of
>> the same set of listeners that are registered globally.
>> 3. Event listener configurations are usually static. It makes more sense
>> to me to define it once and then repeatedly use it, instead of
>> re-registering it every time I start an application.
>>
>> If we register the listeners at table level, we can add a hook in
>> TableOperations to get a set of listeners to emit specific events. The
>> listeners could be defined and serialized as a part of the table
>> properties, or maybe even a part of the Iceberg spec.
>>
>> This is really just my brainstorming. Maybe it's a bit overkill, maybe I
>> am missing the correct way to use the Listeners static registry. It would
>> be great if anyone could provide more contexts or thoughts around this
>> topic.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> --
> Regards,
> Neelesh S. Salian
>
>


Re: Single multi-process commit

2021-12-03 Thread Kyle Bendickson
This could also be achieved using the Write-Audit-Publish feature I
believe, where you audit a set of writes and then choose to publish them.
Though I'm not as familiar with that feature, but you might look into that
as well.

Thanks,
Kyle Bendickson


Re: REST catalog proposal

2021-12-14 Thread Kyle Bendickson
Hi Ryan,

Sorry for the late response.

I feel Jack and Ryan have summed up things very well.

I will also answer the questions from my perspective, as you did ask and I
do a few thoughts outside of what was shared.

For starters, this is an additional catalog. The other catalogs, as well as
the ability to write your own catalog (which I hear about quite a lot), are
in no way affected by this. If catalogs choose to evolve to work similarly,
that’s ok but by no means expected. This is just another catalog.

As for the word specification: I think there’s been some confusion around
this. The current PR is labeled a specification because it’s an OpenAPI
spec. The present PR nor really the work on the REST catalog are intended
as a catalog specification, especially not in the way that table spec v1
and table spec v2 are.

I think in the long term, it would be great to be able to point to
something as a specification for Catalogs. However, I’m hesitant to say
that it would be the REST catalog. I’m not personally uncomfortable with
the interfaces as the source of truth, as they have been for some time.
Plus, catalogs change from engine to engine a good bit.

As for the individual questions, they’ve mostly been answered I feel. I’ll
address the ones that seem unanswered in more detail and answer them fully
for the record.

* Is this a spec at the level that the table spec exists or is this an
informative PR to agree on the REST api of _a_ catalog?


Informative PR to agree on the REST api of _a_ catalog.


* Is it meant to enshrine the `Catalog` interface into a spec? This came up
on a python sync also


Answered above, but no. Someday that might be good, and it might be
beneficial for non-Java devs to be able to see curl examples with
representative JSON as a form of documentation via the REST catalogs docs,
but no.


* Will there be both server and client modules in the iceberg codebase? I
would expect that at least a reference implementation of a server would be
a good thing but this would be the first part of the codebase that runs as
a server instead of as client code in an engine. On the other side an open
api spec and a client impl w/o a server sounds like it's missing something.
*


Answered by others, but I don’t personally believe in just mocking for
tests. So there will likely be some minimum implementation in that regard.
I also think with time we’ll see more intricate things open sourced (even
like we’ve seen with the Aliyun catalog and their tests).


But I don’t think that a client impl and an OpenAPI spec are missing
something in this case, because the idea is very much to decouple the
server logic from the catalog. And to allow users to use their own set of
tools, just like many people have rather complicated HMS shims. But with
time I like to think we’ll see something open sourced, and at least for
testing we will need a minimum implementation.


* It may be early to say for sure but does a server implementation imply
authn/z, database backends, deployment artifacts and all the other fun
things that go into a server side component?


I think it’s too early to say. For now, we’re just trying to get the very
fundamentals for a REST catalog in order. I don’t know about publishing
artifacts ourselves, especially given ASF rules on things could somehow
weirdly come into play (similar to how we haven’t officially had a Python
client). Outside of that, I think it’s too early to say. We’ll see as time
goes on.


- Kyle (GitHub @kbendick)


On Mon, Dec 13, 2021 at 7:34 PM Ryan Blue  wrote:

> I think Jack does a great job of pointing out a lot of the advantages. I
> agree with him, but I’ll add my perspective as well. I suggested the REST
> catalog a couple (few?) months ago when we were talking about the DynamoDB
> catalog and it stuck with me as the solution to quite a few problems.
>
> First, although you can plug a catalog implementation into the classpath
> for clients, that’s not always a good idea. JDBC is a good example, where
> you probably don’t want a ton of connections going directly to a database.
> An intermediate service is a great way to scale such a metastore. As Jack
> noted, it’s also nice to implement catalogs like JDBC with a service so
> that you get the exact same behavior across languages without implementing
> it twice with different DB APIs.
>
> Along the same lines, many hosted processing engines aren’t going to
> support customers plugging arbitrary code into processing engines. When I
> was at Netflix, we used a custom metastore to track tables. That worked
> great, but it meant that our platform was incompatible with things like AWS
> Athena because we’d either have to plug in a Jar or get them to implement a
> bespoke REST protocol just for us. Right now, catalog customization is only
> available if you use the Hive thrift API, which is not a fun way to go if
> you just want to try out a hosted processing engine. By building a common
> protocol and client, we can hopefully g

Re: Time-sliced incremental scan

2022-01-08 Thread Kyle Bendickson
Thank you Ryan for summarizing that so well.

I'm in agreement that it's too convenient to simply ignore due to those
caveats, though they are admittedly potentially large caveats.

However, some people don't interact with their table that way and I see
discussion around ways to implement incremental scans often. I agree this
syntax is reasonable and that it's suefuk enough to implement despite the
challenges.

I believe there's been some previous work on this. Without looking at the
other thread again, I do think we should either resurrect that previous
work or try to prioritize this.

Especially with certain other convenience mechanisms coming up such as
snapshots and branches on the horizon, I would hopefully get this in sooner
than later so that it's existence can be considered as these new features
are developed.

- Kyle (kbendick)


On Fri, Jan 7, 2022 at 4:35 PM Ryan Blue  wrote:

> Walaa,
>
> At supporting syntax for VERSIONS BETWEEN SYSTEM TIME ... AND ... seems
> reasonable to me. I think it’s often really nice to be able to select the
> changes between two points in time for debugging. It would also be nice to
> be able to do the same for snapshot IDs, so you could reliably use similar
> syntax for incremental consumption.
>
> There are some challenges with implementing VERSIONS BETWEEN ... that I
> want to highlight, though. First, the FOR SYSTEM_TIME syntax produces the
> result that would have been returned if you ran the query at that time.
> That uses the table history instead of snapshot creation timestamps. If you
> update the table and then roll back to a previous snapshot, FOR
> SYSTEM_TIME will read the rolled back snapshot if the timestamp is in the
> interval when it was the table’s current state.
>
> We will need to decide whether VERSIONS BETWEEN ... uses table history or
> snapshot creation timestamps. If it uses table history, there may be
> intervals that don’t have a linear history and the query would fail. If it
> uses snapshot creation timestamps, then you’d be able to select intervals
> that never really existed, like between commits in a transaction.
>
> So there are major issues with timestamps. We would want to make it clear
> that timestamps are for convenience, and not for incremental consumption
> (because of the issue from that thread) and may not reflect the actual
> table state (or may fail). That said, it’s a really convenient feature and
> I would support adding both VERSIONS BETWEEN with timestamp and snapshot
> ID.
>
> Ryan
>
> On Thu, Jan 6, 2022 at 9:52 PM Walaa Eldin Moustafa 
> wrote:
>
>> Hi Iceberg devs,
>>
>> We have been considering the problem of Time-sliced incremental scan
>> (i.e., reading data that is committed between two timestamps), and I ran
>> into this thread [1] in the Iceberg dev mailing list. The summary of the
>> thread is that incremental scan should leverage snapshot IDs as opposed to
>> timestamps since there is no guarantee that commit timestamps are linear.
>>
>> I wanted to follow up on that discussion to see if folks are open to
>> still supporting time-sliced incremental scan APIs with the caveat above.
>> The reasons are many fold:
>>
>> * Time-slice APIs are more human friendly and simplify the need for a
>> state store to track last read snapshot IDs. While some sort of state
>> recording may be required in some cases, it is not always the case,
>> especially if data consumption happens at regular intervals.
>>
>> * I understand the caveat discussed in the thread still applies to the
>> existing "TIMESTAMP AS OF" API, yet the API is supported. By extension, I
>> think it is fair to extend the support to time-sliced incremental scan.
>>
>> * Iceberg already provides a deterministic function to translate a
>> timestamp to a snapshot ID [2]. Since a table can be scanned by snapshot
>> range, I think it is reasonable to allow scanning by timestamp range, since
>> the translation mechanism already exists.
>>
>> * Incremental scan using timestamp range is part of the SQL Standard
>> (e.g. "VERSIONS BETWEEN SYSTEM TIME ... AND ...") and is supported by some
>> existing engines. See [3] for SQL Server support.
>>
>> * Conceptually, it is possible to implement the same query semantics at
>> the SQL level using existing APIs and operators such as "TIMESTAMP AS OF"
>> and "EXCEPT" (e.g., by selecting: (T as of timestamp1) EXCEPT (T as of
>> timestamp2)). It sounds that incremental scan is an optimization to push
>> the differencing operation to the data source as opposed to letting the
>> engine deal with it, so it is better to implement the SQL shorthand, and
>> its corresponding data source optimization.
>>
>> Therefore, I would like to ask if we can proceed with supporting an API
>> along the lines of "VERSIONS BETWEEN SYSTEM TIME ... AND ...", and consider
>> the discussion in [1] as a caveat that folks using this API (or existing
>> timestamp APIs) have to keep in mind, until the implementation guarantees
>> linearity at some point in

Re: Iceberg engine version maintenance lifecycle

2022-01-08 Thread Kyle Bendickson
Thank you Jack for your thoughts.

I'm very much in agreement with you.

I'd like to discuss the beta version further.

Ideally, to me, the beta version is the minimum change set to work as-is
with that version of the system. We would ideally create a beta that
ignores new features, optimizations, etc where possible, but allowing for
code changes where APIs have changed (eg a method signature changed). Like
when the new folder is added, but deferring PRs that take advantage of new
features unless the old pathways have been removed.

This seems like it would help determine where breaking changes are
introduced. Given that two systems are changing at once (Iceberg and the
engine), there's no guarantee that the new version of the engine isn't
causing some problem, but it would still give us a reproducible build to
try things out on and better ability to point to a changeset / point in
time or test different versions to determine where the problem came from.

Also, I know several people that now test with the SNAPSHOT version,
especially so they can test newer engine support (as well as for upgrade
preparedness). Many of these people have submitted valuable issues and
making it easier for them to test as early as possible, when willing, could
be advantageous.

If we wanted to automate this somewhat, we might be able to create a GitHub
action that pushes to a SNAPSHOT repo for the beta with a new tag, or
possibly the creation of a new branch.

If you'd like to sync about this, I'd be happy to help contribute a Github
Action and scripts to help automate some of the creation of the beta
versions.

Let me know if you'd like to sync on it - particularly if we could attomate
based off the creation of a tag or of a branch etc. Then I could make a PR
for the GithubAction. The creation event seems like the new folder / new
introduction of the Gradle project, but tagging or branching would be
easier to integrate into GH Actions and provide us additional control on
what we determine to be the beta version.

Thanks,
Kyle (kbendick)

On Fri, Jan 7, 2022 at 4:37 PM Ryan Blue  wrote:

> Sorry for the late reply here!
>
> These look reasonable to me. I think that this will help us reason about
> trade-offs next time we have a release issue like the current one. We
> should simply mark the 3.2 support as beta and get the release out next
> time. I also think that we should not create situations with regressions
> like this again. We should probably have copied the old MERGE/UPDATE/DELETE
> plans into 3.2 to avoid a regression and then update them to the new
> implementations later, without affecting releases.
>
> Thanks for writing this up, Jack!
>
> Ryan
>
> On Wed, Dec 15, 2021 at 7:18 PM Jack Ye  wrote:
>
>> Hi everyone,
>>
>> As a part of the ongoing 0.13.0 release, we are starting to formally
>> support multiple engine versions for Spark, Flink and Hive. I think it is
>> worth defining a formal process for us to add a new supported version,
>> maintain existing versions and deprecate old versions. We briefly touched
>> this topic when doing the refactoring, but I think now is a good time to
>> formalize it and place it as a part of the Iceberg public documentation. As
>> a starter for brainstorming, here is the process I think:
>>
>> Each engine has the following lifecycle states:
>>
>> 1. *Beta*: an engine supported is added, but still in the experimental
>> stage. Maybe the engine version itself is still in preview (e.g. Spark
>> 3.0.0-preview), or the engine does not yet have full feature
>> compatibility compared to old versions yet. This state allows us to
>> release an engine version support without the need to wait for feature
>> parity, shortening the release time.
>>
>> 2. *Maintained*: an engine version is being actively maintained by the
>> community. Users can expect feature parity for most features across all the
>> maintained versions. If a feature has to leverage some new engine
>> functionalities that older versions don't have, then feature parity is not
>> required. For code contributors,
>> - New features should always be prioritized first in the latest version
>> (the latest version could be a maintained or beta version)
>> - For features that could be backported, the contributor is encouraged to
>> either also perform backports in separated PRs, or at least create some
>> issues to track the backport.
>> - If the change is small enough like a few lines, updating all versions
>> at once is good enough. Otherwise, using separated PRs for each version is
>> recommended.
>>
>> 3. *Deprecating*: an engine version is no longer actively maintained.
>> People who are still interested in the version can backport any necessary
>> feature or bug fix from newer versions, but the community will not spend
>> effort in achieving feature parity. We recommend users to move towards a
>> newer version, and we expect contributions to the specific version to
>> diminish over time, and eventually no change is added to the v

Re: [VOTE] Release Apache Iceberg 0.13.0 RC1

2022-01-25 Thread Kyle Bendickson
Thank you, Jack!

Quick announcement when testing: *the runtime jars / artifacts for Spark &
Flink have changed naming format *to include the corresponding Spark /
Flink version. The Spark jars also have the Scala version appended at the
end.

*Spark:*
You can test the 0.13.0-rc1, fetching it from the staging maven repository,
with the following command line flags for Spark 3.2: `--packages
'org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0' --repositories
https://repository.apache.org/content/repositories/orgapacheiceberg-1079/`

For other Spark versions than 3.2, use the artifactIds below (in place of
`iceberg-spark-runtime-3.2_2.12` above).

*iceberg-spark-runtime artifact names as of 0.13.0:*
Spark 3.0: `iceberg-spark3-runtime:0.13.0`
Spark 3.1: `iceberg-spark-runtime-3.1_2.12`
Spark 3.2: `iceberg-spark-runtime-3.2_2.12`

The complete package name for any depends on your spark version now.
`iceberg-spark3-runtime` should only be used for Spark 3.0.

*Flink:*
*iceberg-flink-runtime artifact names as of 0.13.0:*
1.12: iceberg-flink-runtime-1.12
1.13: iceberg-flink-runtime-1.13
1.14: iceberg-flink-runtime-1.14

Thank you and happy testing!
- Kyle



On Tue, Jan 25, 2022 at 9:09 AM Jack Ye  wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 0.13.0 release.
>
> The commit ID is ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5
> * This corresponds to the tag: apache-iceberg-0.13.0-rc1
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.13.0-rc1
> *
> https://github.com/apache/iceberg/tree/ca8bb7d0821f35bbcfa79a39841be8fb630ac3e5
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.0-rc1
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1079/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.13.0
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: [VOTE] Release Apache Iceberg 0.13.0 RC2

2022-01-30 Thread Kyle Bendickson
+1 (non-binding)

Verified signature, checksum, rat check, build and ran tests, and tested
the relevant JAR on both Spark 3.1 and 3.2.

- Kyle

On Sun, Jan 30, 2022 at 12:45 AM Szehon Ho  wrote:

> +1 (non-binding)
>
> Verified signature
> Verified checksum
> Rat check
> Built and ran test, all succeed, after some temporary local HMS timeout
> Tested relevant jar with Spark 3.2, created various tables and ran queries
>
> Thanks
> Szehon
>
> On Fri, Jan 28, 2022 at 12:19 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> +1
>> All tests passed for me and signatures, checksum and license all were
>> good to go
>>
>> On Fri, Jan 28, 2022 at 12:32 PM John Zhuge  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Checked signature, checksum, and license.
>>> Ran build and test with OpenJDK 1.8.0_312-b07.
>>>
>>> Ignoring the mr test failures. Maybe my env is not set up correctly?
>>>
>>>- 19 in TestHiveIcebergStorageHandlerWithEngine
>>>- 3 in TestHiveIcebergStorageHandlerWithMultipleCatalogs
>>>
>>>
>>> On Fri, Jan 28, 2022 at 8:41 AM Jack Ye  wrote:
>>>
 Hi Everyone,

 I propose that we release the following RC as the official Apache
 Iceberg 0.13.0 release.

 The commit ID is 72237429ba164c054480dcfbdb9fe1c86c04dcda
 * This corresponds to the tag: apache-iceberg-0.13.0-rc2
 * https://github.com/apache/iceberg/commits/apache-iceberg-0.13.0-rc2
 *
 https://github.com/apache/iceberg/tree/72237429ba164c054480dcfbdb9fe1c86c04dcda

 The release tarball, signature, and checksums are here:
 *
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.0-rc2

 You can find the KEYS file here:
 * https://dist.apache.org/repos/dist/dev/iceberg/KEYS

 Convenience binary artifacts are staged on Nexus. The Maven repository
 URL is:
 *
 https://repository.apache.org/content/repositories/orgapacheiceberg-1080/

 Please download, verify, and test.

 Please vote in the next 72 hours.

 [ ] +1 Release this as Apache Iceberg 0.13.0
 [ ] +0
 [ ] -1 Do not release this because...

>>>
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: New Versioned Iceberg Documentation Site

2022-02-01 Thread Kyle Bendickson
+1 from me. This looks great. Thank you for all your hard work, Sam!

On Tue, Feb 1, 2022 at 10:33 AM Jack Ye  wrote:

> +1, amazing website! And now the website repo is separated we can continue
> to iterate and deploy quickly without affecting the main repo, so no need
> to be 100% perfect as of now.
>
> I will update the 0.13 release note against the new website and we can
> announce them together.
>
> Best,
> Jack Ye
>
> On Tue, Feb 1, 2022 at 8:26 AM Ryan Blue  wrote:
>
>> Good catch. Looks like the anchor links on the spec page are broken.
>> We'll have to get those fixed.
>>
>> I think we should move forward with the update and fix these as we come
>> across them. It's inevitable that we'll have some broken things in a big
>> change, but we don't want to block this improvement on being 100% perfect.
>>
>> On Tue, Feb 1, 2022 at 1:21 AM Ajantha Bhat 
>> wrote:
>>
>>> Nice looking website.
>>>
>>> Is the shared link the final version ? I couldn't see the markdown
>>> anchor tag inside https://iceberg.redai.dev/spec/
>>> It will be useful to have that for sharing specific parts of the spec.
>>>
>>> Also some pages are in light theme and some are in dark theme. Better to
>>> have a unified theme.
>>>
>>> +1 for versioning and overall work.
>>>
>>> Thanks,
>>> Ajantha
>>>
>>> On Tue, Feb 1, 2022 at 1:47 PM Eduard Tudenhoefner 
>>> wrote:
>>>
 +1 on the procedure and the new site looks amazing

 On Tue, Feb 1, 2022 at 3:38 AM Ryan Blue  wrote:

> +1 from me. I think the new site looks great and it is a big
> improvement to have version-specific docs. Thanks for all your work on
> this, Sam!
>
> On Mon, Jan 31, 2022 at 5:48 PM Sam Redai  wrote:
>
>> Hey Everyone,
>>
>> With 0.13.0's approval for release, I think this would be a good time
>> to have a discussion around the proposed versioned documentation site,
>> powered by Hugo. The site is ready to be released and the source code for
>> the site can be found in the apache/iceberg-docs repository:
>> https://github.com/apache/iceberg-docs.
>>
>> In order for everyone to see a dev version of the site live, I've
>> deployed it temporarily to: https://iceberg.redai.dev
>>
>> The markdown files will remain in the apache/iceberg repository and
>> will represent the latest unreleased documentation. PRs for changes to
>> documentation will be made against the apache/iceberg repository. During 
>> a
>> release, the current version of the docs will be copied from
>> apache/iceberg, to apache/iceberg-docs, where a new version can then be
>> deployed by creating a version branch. With the current configuration, a
>> new version of the documentation site is deployed for each branch name, 
>> for
>> example creating an `0.13.0` branch will deploy an `0.13.0` version of 
>> the
>> site. A particular version can be reached at /docs/. We will 
>> also
>> maintain a `latest` branch which will be a clone of the latest version
>> branch. All links at the top level (such as from the landing-page) will
>> link to the `docs/latest` site.
>>
>> If everyone is ok with this, I'll reach out to the ASF infra team to
>> begin the process.
>>
>> Thanks!
>> -Sam
>>
>
>
> --
> Ryan Blue
> Tabular
>

>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: [VOTE] Release Apache Iceberg 0.13.1 RC0

2022-02-14 Thread Kyle Bendickson
+1 (non-binding)

License checks, various smoke tests for create table, update, merge into,
deletes, etc against Java 11 and Spark 3.2 and 3.1.

- Kyle Bendickson

On Mon, Feb 14, 2022 at 12:32 PM Ryan Blue  wrote:

> +1 (binding)
>
> * Ran license checks, verified checksum and signature
> * Built the project
>
> Thanks, Amogh and Jack for managing this release!
>
> On Sun, Feb 13, 2022 at 10:22 PM Jack Ye  wrote:
>
>> +1 (binding)
>>
>> verified signature, checksum, license. The checksum was generated using
>> the old buggy release script because it was executed in the 0.13.x branch
>> so it still used the full file path. I have updated it to use the relative
>> file path. In case anyone sees checksum failure, please re-download the
>> checksum file and verify again.
>>
>> Ran unit tests for all engine versions and JDK versions, AWS Integration
>> tests. For the Spark flaky test, given #4033 fixes the issue and it was not
>> a bug of the source code, I think we can continue without re-cut a
>> candidate.
>>
>> Tested basic operations, copy-on-write delete, update and rewrite data
>> files on AWS EMR Spark 3.1 Flink 1.14 and verified fixes #3986 and #4024.
>>
>> I did some basic tests for #4023 (the predicate pushdown fix) but I don't
>> have a large Spark 3.2 installation to further verify the performance. It
>> would be great if anyone else could do some additional verifications.
>>
>> Best,
>> Jack Ye
>>
>> On Fri, Feb 11, 2022 at 8:24 PM Manong Karl  wrote:
>>
>>> It's  flaky. This exception is only found in one agent of TeamCity.
>>> Changing agents will resolve the issue.
>>>
>>> Ryan Blue  于2022年2月12日周六 08:57写道:
>>>
>>>> Does that exception fail consistently, or is it a flaky test? We
>>>> recently fixed another Spark test that was flaky because of sampling and
>>>> sort order: https://github.com/apache/iceberg/pull/4033
>>>>
>>>> On Thu, Feb 10, 2022 at 7:12 PM Manong Karl 
>>>> wrote:
>>>>
>>>>> I got an issue failed on spark 3.2
>>>>> TestMergeOnReadDelete.testDeleteWithSerializableIsolation[catalogName =
>>>>> testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config =
>>>>> {type=hive, default-namespace=default}, format = orc, vectorized = true,
>>>>> distributionMode = none] · Issue #4090 · apache/iceberg (github.com)
>>>>> <https://github.com/apache/iceberg/issues/4090>.
>>>>> Is it just my exception?
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>


Re: [DISCUSS] Align the spark runtime artifact names among spark2.4, spark3.0, spark3.1 and spark3.2

2022-02-20 Thread Kyle Bendickson
Thanks for bringing this up Jeff!

Normally I agree, it’s not a good practice to change artifact name.
However, in this case, the artifact has changed already. The
“spark3-runtime” used to be for all versions of Spark 3 (at the time Spark
3.0 and 3.1). It no longer is, as it’s only tested / used with Spark 3.0.

I encounter many users who have upgraded to newer versions of Spark, but
have not upgraded the artifact to the newly versioned by Spark name system
as “spark3-runtime” sounds like it encompasses all versions. And they
encounter subtle bugs and it’s not a great user experience to solve
upgrading that way.

These users are, however, updating the Iceberg artifact to the new versions.

So I think in this case, breaking naming has benefits. As users who go to
upgrade when new Iceberg version are released, and their dependency is not
found, they will hopefully check maven and see the new naming convention /
artifacts.

So I support option 2 also, with naming with Spark and Scala versions.
Otherwise, we continue to see people using the old “spark3-runtime” as they
upgrade Spark versions and encounter subtle errors (class not found, wrong
type signatures due to version mismatch).

Users eventually have to upgrade their pom if / when they upgrade Spark,
due to incompatibility. This way at least, breaking will be loud as there’s
won’t be a new Iceberg version,

Is it possible to mark to the old spark3-runtime / spark-runtime as
deprecated or otherwise point to the new artifacts in Maven?

- Kyle

On Sun, Feb 20, 2022 at 9:41 PM Jeff Zhang  wrote:

> I don't think it is best practice to just change the artifact name of
> published jars. Unless we publish a new version with the new naming
> convention.
>
> On Mon, Feb 21, 2022 at 12:36 PM Jack Ye  wrote:
>
>> I think option 2 is ideal, but I don't know if there is any hard
>> requirement from ASF/Maven Central side for us to keep backwards
>> compatibility of package names published in maven. If there is a
>> requirement then we cannot change it.
>>
>> As a mitigation, I stated in
>> https://iceberg.apache.org/multi-engine-support that Spark 2.4 and 3.0
>> jar names do not follow the naming convention of newer versions for
>> backwards compatibility.
>>
>> Best,
>> Jack Ye
>>
>> On Sun, Feb 20, 2022 at 7:03 PM OpenInx  wrote:
>>
>>> Hi everyone
>>>
>>> The current spark2.4, spark3.0 have the following unaligned runtime
>>> artifact names:
>>>
>>> # Spark 2.4
>>> iceberg-spark-runtime-0.13.1.jar
>>> # Spark 3.0
>>> iceberg-spark3-runtime-0.13.1.jar
>>> # Spark 3.1
>>> iceberg-spark-runtime-3.1_2.12-0.13.1.jar
>>> # Spark 3.2
>>> iceberg-spark-runtime-3.2_2.12-0.13.1.jar
>>>
>>> From the spark 3.1 and spark 3.2's runtime artifact names, we can easily
>>> recognize:
>>> 1. What's the spark major version that the runtime jar is attached to
>>> 2. What's the spark scala version that the runtime jar is compiled with
>>>
>>> But for spark 3.0 and spark 2.4,  it's not easy to understand what's the
>>> above information.  I think we kept those legacy names because they were
>>> introduced in older iceberg releases and we wanted to avoid changing the
>>> modules that users depend on and opted not to rename, but they are indeed
>>> causing confusion for the new community users.
>>>
>>> In general,   we have two options:
>>>
>>> Option#1:  keep the current artifact names, that mean spark 2.4 & spark
>>> 3.0 will always use the iceberg-spark-runtime-.jar and
>>> iceberg-spark3-runtime-.jar until them get retired in the
>>> apache iceberg official repo.
>>> Option#2:  Change the spark2.4 & spark3.0's artifact names to the
>>> generic name format:
>>> iceberg-spark-runtime-_-.jar.
>>>  It makes sharing all the consistent name format between all the spark
>>> versions.
>>>
>>> Personally, I'd prefer option#2 because that looks more friendly for new
>>> community users (although it will require the old users to change their
>>> pom.xml to the new version).
>>>
>>> What is your preference ?
>>>
>>> Reference:
>>> 1.  Created a PR to change the artifact names and we had few discussions
>>> there. https://github.com/apache/iceberg/pull/4158
>>> 2.  https://github.com/apache/iceberg-docs/pull/27#discussion_r800297155
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread Kyle Bendickson
Hi Openinx.

Thanks for bringing this to our attention. And many thanks to hiliwei for
their willingness to tackle big problems and little problems.

I wanted to say that I think most anything that’s relatively close would be
better than the current situation most likely (where the feature is
disabled entirely).

Thank you for your succinct summary of the situation. I tagged Dongjoon
Hyun, one of the ORC VPs, in the PR and will reach out to him as well.

I am inclined to agree that we need to consider the width of the types, as
fields like binary or even string can be potentially quite wide compared to
int.

I like your suggestion to use an “average width” when used with the batch
size, though subtracting batch size from average width seems slightly off…
I would think maybe the average width needs to be multiples or divided with
the batch size. Possibly I’m not understanding fully.

How would you propose to get an “average width”, for use with the data
that’s not been flushed to disk yet? And would it be an average width based
on the actually observed data or just on the types?

Again, I think that any approach is better than none, and we can iterate on
the statistics collection. But I am inclined to agree, points (1) and (2)
seem ok. And it would be beneficial to consider the points raised regarding
(3).

Thanks for bringing this to the dev list.

And many thanks to hiliwei for their work so far!

- Kyle

On Thu, Mar 3, 2022 at 8:01 PM OpenInx  wrote:

> Hi Iceberg dev
>
> As we all know,  in our current apache iceberg write path,  the ORC file
> writer cannot just roll over to a new file once its byte size reaches the
> expected threshold.  The core reason that we don't support this before is:
>   The lack of correct approach to estimate the byte size from an unclosed
> ORC writer.
>
> In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
> trying to propose an estimate approach to fix this fundamentally (Also
> enabled all those ORC writer unit tests that we disabled intentionally
> before).
>
> The approach is:  If a file is still unclosed , let's estimate its size in
> three steps ( PR:
> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
> )
>
> 1. Size of data that has been written to stripe.The value is obtained by
> summing the offset and length of the last stripe of the writer.
> 2. Size of data that has been submitted to the writer but has not been
> written to the stripe. When creating OrcFileAppender, treeWriter is
> obtained through reflection, and uses its estimateMemory to estimate how
> much memory is being used.
> 3. Data that has not been submitted to the writer, that is, the size of
> the buffer. The maximum default value of the buffer is used here.
>
> My feeling is:
>
> For the file-persisted bytes , I think using the last strip's offset plus
> its length should be correct. For the memory encoded batch vector , the
> TreeWriter#estimateMemory should be okay.
> But for the batch vector whose rows did not flush to encoded memory, using
> the batch.size shouldn't be correct. Because the rows can be any data type,
> such as Integer, Long, Timestamp, String etc. As their widths are not the
> same, I think we may need to use an average width minus the batch.size
> (which is row count actually).
>
> Another thing is about the `TreeWriter#estimateMemory` method,  The
> current `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
> `estimateMemory` method to public,  I will suggest to publish a PR to
> apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>
> I'd like to invite the iceberg dev to evaluate the current approach.  Is
> there any other concern from the ORC experts' side ?
>
> Thanks.
>


Re: Welcome Szehon Ho as a committer!

2022-03-11 Thread Kyle Bendickson
Congratulations. Szehon!

Well deserved!

On Fri, Mar 11, 2022 at 4:06 PM Steven Wu  wrote:

> Congrat, Szehon!
>
> On Fri, Mar 11, 2022 at 4:05 PM Chao Sun  wrote:
>
>> Congratulations Szehon!
>>
>> On Fri, Mar 11, 2022 at 4:01 PM OpenInx  wrote:
>> >
>> > Congrats Szehon!
>> >
>> > On Sat, Mar 12, 2022 at 7:55 AM Steve Zhang 
>> > 
>> wrote:
>> >>
>> >> Congratulations Szehon, Well done!
>> >>
>> >> Thanks,
>> >> Steve Zhang
>> >>
>> >>
>> >>
>> >> On Mar 11, 2022, at 3:51 PM, Jack Ye  wrote:
>> >>
>> >> Congratulations Szehon!!
>> >>
>> >> -Jack
>> >>
>> >> On Fri, Mar 11, 2022 at 3:45 PM Wing Yew Poon
>>  wrote:
>> >>>
>> >>> Congratulations Szehon!
>> >>>
>> >>>
>> >>> On Fri, Mar 11, 2022 at 3:42 PM Sam Redai  wrote:
>> 
>>  Congrats Szehon!
>> 
>>  On Fri, Mar 11, 2022 at 6:41 PM Yufei Gu 
>> wrote:
>> >
>> > Congratulations Szehon!
>> > Best,
>> >
>> > Yufei
>> >
>> > `This is not a contribution`
>> >
>> >
>> > On Fri, Mar 11, 2022 at 3:36 PM Ryan Blue  wrote:
>> >>
>> >> Congratulations Szehon!
>> >>
>> >> Sorry I accidentally preempted this announcement with the board
>> report!
>> >>
>> >> On Fri, Mar 11, 2022 at 3:32 PM Anton Okolnychyi
>>  wrote:
>> >>>
>> >>> Hey everyone,
>> >>>
>> >>> I would like to welcome Szehon Ho as a new committer to the
>> project!
>> >>>
>> >>> Thanks for all your work, Szehon!
>> >>>
>> >>> - Anton
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Tabular
>> >>
>> >>
>>
>


Re: [discuss] keep the commit history when adding a new engine version

2022-05-13 Thread Kyle Bendickson
I agree this is a good point.

The git history is not retained when we port the way we currently do.

So +1 as I understand it, the latest version will generally be the one to
have the most git commit history. Possibly looking back for changes that
occurred due to some other version.

Thanks Liwei!

- Kyle

On Fri, May 13, 2022 at 12:10 AM Rajarshi Sarkar 
wrote:

> +1, good point.
>
> Regards,
> Rajarshi Sarkar
>
>
> On Fri, May 13, 2022 at 9:03 AM Reo Lei  wrote:
>
>> That is great ! +1 for this.
>>
>> liwei li  于2022年5月13日周五 11:12写道:
>>
>>> Correct a clerical errr:
>>>
>>> 3. Modify the v3.3 files to make it work for spark 3.3 correct.
>>>
>>> Liwei Li
>>> --
>>> *From:* Steven Wu 
>>> *Sent:* Friday, May 13, 2022 11:06:26 AM
>>> *To:* dev@iceberg.apache.org 
>>> *Subject:* Re: [discuss] keep the commit history when adding a new
>>> engine version
>>>
>>> This is a good point. +1 for the proposal.
>>>
>>> On Thu, May 12, 2022 at 7:46 PM liwei li  wrote:
>>>
>>> Hi, guys
>>> When we want to add support for a new version of an engine, simply
>>> copying files from an old version to a new directory will cause git commit
>>> history to be lost, making it difficult to find file change records, we can
>>> only go to look for changes in the old path, but we don't know it in which
>>> one, and the old may have been deleted. Is there a better way to keep it?
>>> I recommend that we first rename the old version to the new one, and
>>> then make a new copy as the old version.
>>> For example, if we want to add Spark 3.3, we can do the following:
>>> 1. Change the path of version 3.2 from v3.2 to v3.3
>>> 2. Create a copy of v3.2 from v3.3
>>> 3. Modify the v3.2 file to make it work for spark 3.3 correct.
>>> What do you think of the above? Or is it necessary? Or if there is
>>> another better way?
>>> Thank you.
>>>
>>> Liwei Li
>>> hilili...@gmail.com
>>>
>>>


Re: [VOTE] Release Apache Iceberg 0.13.2 RC1

2022-06-05 Thread Kyle Bendickson
Thanks Eduard!

I have:
- verified the signature
- verified the checksum in the file given as well as of the artifact
- ran all unit tests on Java 11, all passed
- ran all unit tests on Java 8, some hive-3 tests consistently fail (I do
notice they passed on Github - but the tests which fail are consistent
despite giving the JVM more memory and checking for OOM)
- ran a simple smoke test suite of CRUD on namespaces and v1 and v2 tables
with Spark (3.2, 3.1) and Flink (1.13 and 1.14).
- ran some upsert related tests on Flink 1.13 and 1.14 (1.12 is provided a
deprecation notice)

*Problems:*
I did notice that the *given commit ID is considered unattached (and I
wasn't able to check it out).* I am running my tests by using the provided
JAR with engines and then running unit tests locally for the commit just
prior (with commit ID *fae977a9f0a79266a04647b0df2ab540cf0dcff4*).

Not sure if this is a huge issue, but outside of this unattached commit, my
only concern is the `iceberg-hive3` failing tests, but as they passed in CI
it's possibly an issue with my local setup locally.

Running hive-3 test suite alone, the same tests failed multiple times but
again might be something to do with my computer / JVM configuration.

*I am -1 (non-binding)*, primarily based on the detached commit (as I had
quite a good bit of trouble trying to fetch it through my normal processes)
as well as the failing hive3 tests (though that's not exactly within my
area of expertise).

If the hive3 test failures are only something that occurs for me, then if
we fix the "Add version.txt commit" in branch 0.3.x such that when I fetch
branch 0.3.x it's present, I'd be +1. Unfortunately, I can't help with
cleaning up with the release branch outside of advising somebody else (if
desired), but I'm happy to help with that.

The hive3 test failures for me seem to be OOM related, but I raised my

Find attached a picture of the detached commit ID,
*0784d64a659abd4fdaa82cdb599a250a7514facf*, per Github.

[image: image.png]

Example test failures
org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine >
testCBOWithSelectedColumnsOverlapJoin[fileFormat=AVRO, engine=tez,
catalog=HIVE_CATALOG, isVectorized=false] FAILED
java.lang.IllegalArgumentException: Failed to execute Hive query
'SELECT c.first_name, o.order_id FROM default.orders o JOIN
default.customers c ON o.customer_id = c.customer_id ORDER BY o.order_id
DESC': Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
at
org.apache.iceberg.mr.hive.TestHiveShell.executeStatement(TestHiveShell.java:152)
at
org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine.testCBOWithSelectedColumnsOverlapJoin(TestHiveIcebergStorageHandlerWithEngine.java:236)

Caused by:
org.apache.hive.service.cli.HiveSQLException: Error while
processing statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.tez.TezTask
at
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:226)
at
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:263)
at
org.apache.hive.service.cli.operation.Operation.run(Operation.java:247)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:541)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:510)
at
org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:267)
at
org.apache.iceberg.mr.hive.TestHiveShell.executeStatement(TestHiveShell.java:139)
... 1 more

Thanks you for working on this,
Kyle




On Wed, Jun 1, 2022 at 11:12 PM Eduard Tudenhoefner 
wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 0.13.2 release.
>
> The commit ID is *0784d64a659abd4fdaa82cdb599a250a7514facf*
>
>
>- This corresponds to the tag: *apache-iceberg-0.13.2-rc1*
>- https://github.com/apache/iceberg/commits/apache-iceberg-0.13.2-rc1
>-
>
> https://github.com/apache/iceberg/tree/0784d64a659abd4fdaa82cdb599a250a7514facf
>
>
> The release tarball, signature, and checksums are here:
>
>-
>https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.13.2-rc1
>
>
> You can find the KEYS file here:
>
>- https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
>
>-
>https://repository.apache.org/content/repositories/orgapacheiceberg-1088/
>
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: [VOTE] Release Apache Iceberg 0.13.2 RC1

2022-06-05 Thread Kyle Bendickson
Update:

Running the test suite in IntelliJ that was (and is) having consistent test
failures via CLI, the issue seems to be resolved.
So I do think it is indeed a local JVM set up issue.

Investigating the differences now, but the class in question is
*org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine*

It seems to be caused by a NoClassDefFoundError, specifically for
org.xerial.snappy.Snappy. It also happens for ORC, but not for parquet.

Included is a sample output:
```
java.lang.NoClassDefFoundError: Could not initialize class
org.xerial.snappy.Snappy
at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:99)
~[snappy-java-1.1.8.jar:1.1.8]
at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:91)
~[snappy-java-1.1.8.jar:1.1.8]
at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:81)
~[snappy-java-1.1.8.jar:1.1.8]
at
org.apache.tez.common.TezUtils.createByteStringFromConf(TezUtils.java:81)
~[tez-api-0.10.1.jar:0.10.1]
```

Apologies for speaking too soon. *I'm now +0 [non-binding] *provided we fix
the 0.13.x branch and associated commitId to not be in a detached state.
The tag *apache-iceberg-0.13.2-rc1 *works just fine, but the 0.13.x branch
doesn't have the commit ID in question. Not sure if that's a major concern
or not.

Cheers,
Kyle

On Sun, Jun 5, 2022 at 11:51 AM Kyle Bendickson  wrote:

> Thanks Eduard!
>
> I have:
> - verified the signature
> - verified the checksum in the file given as well as of the artifact
> - ran all unit tests on Java 11, all passed
> - ran all unit tests on Java 8, some hive-3 tests consistently fail (I do
> notice they passed on Github - but the tests which fail are consistent
> despite giving the JVM more memory and checking for OOM)
> - ran a simple smoke test suite of CRUD on namespaces and v1 and v2 tables
> with Spark (3.2, 3.1) and Flink (1.13 and 1.14).
> - ran some upsert related tests on Flink 1.13 and 1.14 (1.12 is provided a
> deprecation notice)
>
> *Problems:*
> I did notice that the *given commit ID is considered unattached (and I
> wasn't able to check it out).* I am running my tests by using the
> provided JAR with engines and then running unit tests locally for the
> commit just prior (with commit ID
> *fae977a9f0a79266a04647b0df2ab540cf0dcff4*).
>
> Not sure if this is a huge issue, but outside of this unattached commit,
> my only concern is the `iceberg-hive3` failing tests, but as they passed in
> CI it's possibly an issue with my local setup locally.
>
> Running hive-3 test suite alone, the same tests failed multiple times but
> again might be something to do with my computer / JVM configuration.
>
> *I am -1 (non-binding)*, primarily based on the detached commit (as I had
> quite a good bit of trouble trying to fetch it through my normal processes)
> as well as the failing hive3 tests (though that's not exactly within my
> area of expertise).
>
> If the hive3 test failures are only something that occurs for me, then if
> we fix the "Add version.txt commit" in branch 0.3.x such that when I fetch
> branch 0.3.x it's present, I'd be +1. Unfortunately, I can't help with
> cleaning up with the release branch outside of advising somebody else (if
> desired), but I'm happy to help with that.
>
> The hive3 test failures for me seem to be OOM related, but I raised my
>
> Find attached a picture of the detached commit ID,
> *0784d64a659abd4fdaa82cdb599a250a7514facf*, per Github.
>
> [image: image.png]
>
> Example test failures
> org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine >
> testCBOWithSelectedColumnsOverlapJoin[fileFormat=AVRO, engine=tez,
> catalog=HIVE_CATALOG, isVectorized=false] FAILED
> java.lang.IllegalArgumentException: Failed to execute Hive query
> 'SELECT c.first_name, o.order_id FROM default.orders o JOIN
> default.customers c ON o.customer_id = c.customer_id ORDER BY o.order_id
> DESC': Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
> at
> org.apache.iceberg.mr.hive.TestHiveShell.executeStatement(TestHiveShell.java:152)
> at
> org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine.testCBOWithSelectedColumnsOverlapJoin(TestHiveIcebergStorageHandlerWithEngine.java:236)
>
> Caused by:
> org.apache.hive.service.cli.HiveSQLException: Error while
> processing statement: FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> at
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:226)
>  

Re: 【Feature】Request support for c++ sdk

2022-06-05 Thread Kyle Bendickson
Hi caneGuy,

I personally don’t dislike this idea. I understand the performance benefits.

But this would be a huge undertaking for the community. We’d need to ensure
we had sufficient developer support for reviews (likely one of the biggest
issues), as well as a number of other things. Particularly dependencies,
package management, etc. We’d also need to scope support down to specific
OS / compilers etc.

We’d also need to be sure we had adequate developer support from a wide
enough range of the community to support the project long term. One issue
in open source is that developers will work on something tangential to
their project in another repository, but nobody is available to maintain it.

There’s also the question of how useful this would be in practice given the
complexity of using C++ (or Rust etc) within some of the major frameworks.

Again, I’m not opposed to the idea but just trying to be realistic about
the realities of such an undertaking. It would need full community support
(or at least support from enough community members to be sustainable).

If you wanted to make a design doc, the milestones tab in the Iceberg
project has some that you might use as reference.

*I highly suggest you come to the next community sync and bring this up to
the community then.*

If you’re not already on the invite list for the monthly community sync,
you can get on it by joining the Google group. You’ll receive incites when
they go out:
https://groups.google.com/g/iceberg-sync

Looking forward to seeing you at the next community sync.

A design document and/or any prior art would be very helpful as the
community sync does discuss many topics (possibly there is existing C++
support in StarRocks for Iceberg V1?).

Thank you,
Kyle Bendickson
GitHub: kbendick

On Sun, Jun 5, 2022 at 10:44 PM Sam Redai  wrote:

> Currently there is no existing effort to develop a C++ package. That being
> said I think it would be awesome to have one! If anyone is willing to start
> that development effort, I can help with some of the ground work to
> kickstart it.
>
> I would say the first step would be for someone to prepare a high-level
> proposal.
>
> -Sam
>
> On Sun, Jun 5, 2022 at 11:02 PM 周康  wrote:
>
>> Hi team
>> I am a dev from StarRocks community, and we have supported iceberg v1
>> format.
>> We are also planning to support v2 format. If there is a C++ package, it
>> will be very convenient for our implementation.
>> At the same time, other c++ computing engines support v2 format will also
>> be faster.
>>
>> Do we have plans to support c++ version sdk?
>> --
>> caneGuy
>>
> --
>
> Sam Redai 
>
> Developer Advocate  |  Tabular <https://tabular.io/>
>
> c (267) 226-8606
>


Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Kyle Bendickson
+1 [non-binding]

Thank you Piotr for all of the work you’ve put into this.

This should greatly benefit not only Iceberg on Trino, but hopefully can be
used in many novel ways due to its well thought out generic design and
incorporation of the ability to extend with new sketches.

Looking forward to the improvements this will bring.

- Kyle

On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
wrote:

> +1, let's do it!
>
> On Fri, Jun 10, 2022 at 2:47 PM John Zhuge  wrote:
>
>> +1  Looking forward to the features it enables.
>>
>> On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu  wrote:
>>
>>> +1. Looking forward to the partition stats.
>>> Best,
>>>
>>> Yufei
>>>
>>>
>>> On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks  wrote:
>>>
 +1 as well.  Excited about the progress here.

 -Dan

 On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
 wrote:

> +1, really nice! Indexes are coming!
>
> On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
> wrote:
>
>> +1, it's an exciting step for Iceberg, look forward to all the new
>> statistics and secondary indices it will allow.
>>
>> Had a few questions of what the reference to Puffin file(s) will be
>> in the Iceberg spec, but it's orthogonal to Puffin file format itself.
>>
>> Thanks,
>> Szehon
>>
>> On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue  wrote:
>>
>>> +1 from me!
>>>
>>> There may also be people that haven't followed the design
>>> discussions and we can start a DISCUSS thread if needed. But if 
>>> everyone is
>>> comfortable with the design and implementation, I think it's ready for a
>>> vote as well.
>>>
>>> Huge thanks to Piotr for getting this ready! I think the format is
>>> going to be really useful for both stats and indexes in Iceberg.
>>>
>>> On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <
>>> pi...@starburstdata.com> wrote:
>>>
 Hi Everyone,

 I propose that we adopt Puffin file format as a file format for
 statistics and indexes in Iceberg tables.

 Puffin file format specification:
 https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
 (previous discussions:  https://github.com/apache/iceberg/pull/4944
 , https://github.com/apache/iceberg-docs/pull/69)

 Intend use:
 * statistics in Iceberg tables (see
 https://github.com/apache/iceberg/pull/4945 and associated
 proposed implementation https://github.com/apache/iceberg/pull/4741
 )
 * in the future: storage for secondary indexes

 Puffin file reader and writer implementation:
 https://github.com/apache/iceberg/pull/4537

 Thanks,
 PF


>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Best Regards
>

>>
>> --
>> John Zhuge
>>
>


Re: 【Feature】Request support for c++ sdk

2022-06-12 Thread Kyle Bendickson
red approach is to make
>> a
>> > > clean room Avro parser.  But I agree this is a non-trivial effort to
>> get
>> > > underway.
>> > >
>> > > Another area to consider is compatibility testing.  I think before a
>> third
>> > > officially supported community library is introduced it would be good
>> to
>> > > have a compatibility framework in place to make sure implementations
>> are
>> > > all interpreting the specification correctly.  If there isn't already
>> an
>> > > effort here, I'd like to start contributing something (probably will
>> have
>> > > bandwidth sometime place in Q3).
>> > >
>> > > Thanks,
>> > > -Micah
>> > >
>> > >
>> > > [1] https://arrow.apache.org/docs/cpp/dataset.html
>> > >
>> > > On Sun, Jun 5, 2022 at 11:07 PM Kyle Bendickson 
>> wrote:
>> > >
>> > >> Hi caneGuy,
>> > >>
>> > >> I personally don’t dislike this idea. I understand the performance
>> > >> benefits.
>> > >>
>> > >> But this would be a huge undertaking for the community. We’d need to
>> > >> ensure we had sufficient developer support for reviews (likely one
>> of the
>> > >> biggest issues), as well as a number of other things. Particularly
>> > >> dependencies, package management, etc. We’d also need to scope
>> support down
>> > >> to specific OS / compilers etc.
>> > >>
>> > >> We’d also need to be sure we had adequate developer support from a
>> wide
>> > >> enough range of the community to support the project long term. One
>> issue
>> > >> in open source is that developers will work on something tangential
>> to
>> > >> their project in another repository, but nobody is available to
>> maintain it.
>> > >>
>> > >> There’s also the question of how useful this would be in practice
>> given
>> > >> the complexity of using C++ (or Rust etc) within some of the major
>> > >> frameworks.
>> > >>
>> > >> Again, I’m not opposed to the idea but just trying to be realistic
>> about
>> > >> the realities of such an undertaking. It would need full community
>> support
>> > >> (or at least support from enough community members to be
>> sustainable).
>> > >>
>> > >> If you wanted to make a design doc, the milestones tab in the Iceberg
>> > >> project has some that you might use as reference.
>> > >>
>> > >> *I highly suggest you come to the next community sync and bring this
>> up
>> > >> to the community then.*
>> > >>
>> > >> If you’re not already on the invite list for the monthly community
>> sync,
>> > >> you can get on it by joining the Google group. You’ll receive
>> incites when
>> > >> they go out:
>> > >> https://groups.google.com/g/iceberg-sync
>> > >>
>> > >> Looking forward to seeing you at the next community sync.
>> > >>
>> > >> A design document and/or any prior art would be very helpful as the
>> > >> community sync does discuss many topics (possibly there is existing
>> C++
>> > >> support in StarRocks for Iceberg V1?).
>> > >>
>> > >> Thank you,
>> > >> Kyle Bendickson
>> > >> GitHub: kbendick
>> > >>
>> > >> On Sun, Jun 5, 2022 at 10:44 PM Sam Redai  wrote:
>> > >>
>> > >>> Currently there is no existing effort to develop a C++ package. That
>> > >>> being said I think it would be awesome to have one! If anyone is
>> willing to
>> > >>> start that development effort, I can help with some of the ground
>> work to
>> > >>> kickstart it.
>> > >>>
>> > >>> I would say the first step would be for someone to prepare a
>> high-level
>> > >>> proposal.
>> > >>>
>> > >>> -Sam
>> > >>>
>> > >>> On Sun, Jun 5, 2022 at 11:02 PM 周康 
>> wrote:
>> > >>>
>> > >>>> Hi team
>> > >>>> I am a dev from StarRocks community, and we have supported iceberg
>> v1
>> > >>>> format.
>> > >>>> We are also planning to support v2 format. If there is a C++
>> package,
>> > >>>> it will be very convenient for our implementation.
>> > >>>> At the same time, other c++ computing engines support v2 format
>> will
>> > >>>> also be faster.
>> > >>>>
>> > >>>> Do we have plans to support c++ version sdk?
>> > >>>> --
>> > >>>> caneGuy
>> > >>>>
>> > >>> --
>> > >>>
>> > >>> Sam Redai 
>> > >>>
>> > >>> Developer Advocate  |  Tabular <https://tabular.io/>
>> > >>>
>> > >>> c (267) 226-8606
>> > >>>
>> > >>
>> >
>>
>

-- 

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io


Re: [ANNOUNCE] Apache Iceberg release 0.13.2

2022-06-15 Thread Kyle Bendickson
Exciting achievement! Congratulations to the hard-working Iceberg community
for reporting and fixing bugs, as well as the on going work on the upcoming
major release!

Many thanks to Eduard for being release manager and to Russell for
providing his keys and support!

On Wed, Jun 15, 2022 at 2:44 PM Steven Wu  wrote:

> Congrats! Thanks Eduard for being the release manager!
>
> On Wed, Jun 15, 2022 at 11:42 AM Eduard Tudenhoefner 
> wrote:
>
>> I'm pleased to announce the release of Apache Iceberg *0.13.2*!
>>
>> Apache Iceberg is an open table format for huge analytic datasets. Iceberg
>> delivers high query performance for tables with tens of petabytes of data,
>> along with atomic commits, concurrent writes, and SQL-compatible table
>> evolution.
>>
>> This release can be downloaded from:
>> https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-0.13.2/apache-iceberg-0.13.2.tar.gz
>>
>> Java artifacts are available from Maven Central.
>>
>> Thanks to everyone for contributing!
>>
>

-- 

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io


Proposal - Improving Github Issues Experience via Templates and Pruning Stale Issues

2022-07-12 Thread Kyle Bendickson
We receive positive feedback for being a project that does a lot of its
work on Github, but there are tools we could use to make certain processes
more efficient for both people reporting issues and those who respond to
issues.

For Github issues, we don't have a template. I often find we ask people who
report issues the same questions; such as "what version of Iceberg?",
"which engine?", "what's your configuration?" etc.

Discussing the problem with @Fokko Driesprong , he
proposed that we use Github Issue Templates
<https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/about-issue-and-pull-request-templates>
as
a way to direct users to submit a structured report for their specific
needs, such as reporting a bug or requesting help.

This would also provide tags added automatically to help search through
issues. We'd still have the option of a blank issue just like we have now.

The Apache Airflow project makes heavy use of issue templates and opening
an issue there is quite a nice experience.
<https://github.com/apache/airflow/issues/new/choose>

Fokko has opened a PR with an initial set of templates based on his
experience in Apache Airflow: https://github.com/apache/iceberg/pull/4867

--

We also have a number of somewhat older issues. This is not uncommon in
open source, as sometimes issues are abandoned or people don't close them
once they're resolved.

As a way to keep issues relevant, I'm proposing a stale issues bot,

The bot will monitor for issues that have not had any interaction after
some configurable time. Issues can be exempted from the bot via a tag,
Commenting on the issue would also restart the timer until it's considered
stale.

Removing stale issues would help us focus on new issues that arise and be
more diligent in supporting the community, plus helping ensure we don't
miss any bug reports.

I've opened a PR to add Stale Bot for issues only:
https://github.com/apache/iceberg/pull/4949

*To summarize:*
To improve user reporting through Github Issues, I propose that we:
1. Add Github Issue templates to collect the necessary information for
common cases, such as bug reports:
https://github.com/apache/iceberg/pull/4867
2. Add a Stale Bot for Github Issues which will tag issues as "stale", with
a comment to the user, after N days that the issue will be closed if
there's no activity (currently 180 days). After an additional time period
of X days (currently 14), if no activity occurs on the issue, it would be
closed: https://github.com/apache/iceberg/pull/4949
<https://github.com/apache/iceberg/pull/4867>

Let me know what you think
- Kyle


-- 

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io


Re: Welcome Fokko Driesprong as a committer!

2022-08-21 Thread Kyle Bendickson
Congratulations Fokko! This is indeed very well deserved! 🥳

It’s a pleasure to work with you!

On Sun, Aug 21, 2022 at 12:57 PM Sam Redai  wrote:

> Huge congrats Fokko, well deserved! 🎉
>
> On Sun, Aug 21, 2022 at 3:55 PM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I would like to welcome Fokko Driesprong as a new committer to the
>> project!
>>
>> Thanks for all your contributions, Fokko!
>>
>>
>> Ryan
>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
>
> Sam Redai 
>
> Developer Advocate  |  Tabular <https://tabular.io/>
>
-- 

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io


Re: Welcome Yufei Gu as a committer

2022-08-25 Thread Kyle Bendickson
Congrats Yufei!

It’s a pleasure working with you and this is very well deserved.

On Thu, Aug 25, 2022 at 8:55 PM Steven Wu  wrote:

> Congrats, Yufei!
>
> On Thu, Aug 25, 2022 at 8:03 PM Daniel Weeks  wrote:
>
>> Congrats!
>>
>> On Thu, Aug 25, 2022, 7:41 PM Reo Lei  wrote:
>>
>>> Congratulations~ 🎊🎊🎊
>>>
>>> Russell Spitzer  于2022年8月26日周五 09:01写道:
>>>
>>>> Congrats!
>>>>
>>>> Sent from my iPad
>>>>
>>>> > On Aug 25, 2022, at 6:20 PM, Anton Okolnychyi
>>>>  wrote:
>>>> >
>>>> > I’d like to welcome Yufei Gu as a committer to the project.
>>>> >
>>>> > Thanks for all your hard work, Yufei!
>>>> >
>>>> > - Anton
>>>>
>>> --

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io