[DISCUSS] Dropping Spark 2.4 support

2023-04-13 Thread Fokko Driesprong
Hi all,

I'm working on moving to Hadoop 3.x
<https://github.com/apache/iceberg/pull/7114>, and one thing is that it
seems to be incompatible with Spark 2.4. I wanted to ask if people are
still on Spark 2.4 and what we think of dropping the support. The last
release of Spark 2.4.8 was on 2021-05-17 and it also looks like the 2.4
branch on the Spark Github repository is stale, so I don't expect any
further releases.

Before creating a PR I would like to check on the mail-list if anyone has
any objections. If so, please let us know.

Thanks,
Fokko Driesprong


Re: Re: C++/Rust SDK sync

2023-04-19 Thread Fokko Driesprong
Thanks again Jan for setting up the session!

One thing that popped up in my mind after the call. For Python, we
generated a lot of the code from the open-API spec
<https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml>.
This will give you all the data classes for the schema, types and calls for
the REST API. It can only be used to bootstrap the code because there are
some complex cases that cannot be described in OpenAPI. For example, the
fixed length binary type is represented as `fixed[22]`, but you need to
provide code to extract the 22 (If you're interested, details can be found
here <https://github.com/apache/iceberg/issues/6798>). Looking forward to
the first PR, and feel free to reach out any time in case of questions.
Happy to provide context or test anything.

Kind regards,
Fokko


Op do 13 apr 2023 om 18:41 schreef Steven Wu :

> Thanks Jan for starting the discussion. Cool to see pretty strong
> interests in this area. I won't be able to attend due to conflict, but I
> will surely watch the recording.
>
>
> On Wed, Apr 12, 2023 at 10:12 AM  wrote:
>
>> I’d love to join this discussion as well and contribute to the effort,
>> any of the times work for me.
>>
>> Abid
>>
>> On 2023/04/12 02:17:06 Chao Sun wrote:
>> > We are also interested in this discussion. Internally, we have been
>> > working on something similar in Rust, so it'd be great if we can
>> > combine the efforts.
>> >
>> > 19th works for me too.
>> >
>> > Chao
>> >
>> > On Tue, Apr 11, 2023 at 6:15 PM Nan Zhu  wrote:
>> > >
>> > > Thanks! yeah, I'd like to join the meeting, April 19 works for me
>> best!
>> > >
>> > > On Tue, Apr 11, 2023 at 1:41 PM Samrose Ahmed 
>> wrote:
>> > >>
>> > >> I'd love to join this discussion, all those times work for me.
>> > >>
>> > >> On Tue, Apr 11, 2023 at 9:35 AM Ryan Blue  wrote:
>> > >>>
>> > >>> The 19th works best for me.
>> > >>>
>> > >>> On Mon, Apr 10, 2023 at 11:27 PM Jack Ye  wrote:
>> > >>>>
>> > >>>> Hi Jan,
>> > >>>>
>> > >>>> Sorry for the late reply, I am currently on vacation. As I said in
>> the community sync, we have folks in AWS that are maintaining an internal
>> C++ implementation, and it would be great to join this meeting to develop a
>> shared community solution together.
>> > >>>>
>> > >>>> I can go with all the time slots, I will loop in related people
>> and see what time slots work for them. I will reply with our preferred time
>> slots and a list of people with emails to invite later this week.
>> > >>>>
>> > >>>> Cheers,
>> > >>>> Jack Ye
>> > >>>>
>> > >>>> On Fri, Apr 7, 2023 at 5:47 AM Driesprong, Fokko 
>> wrote:
>> > >>>>>
>> > >>>>> Hi Jan,
>> > >>>>>
>> > >>>>> Thanks for raising this, and I'd love to join the sync. I did
>> quite a bit of work on the Python implementation, and I'm happy to help
>> with the Rust/C++ SDK as well. I'm neither a Rust nor C++ programmer (did
>> some C++ in the past), but happy to help with the implementation by
>> providing context. For me, all of the abovementioned slots work.
>> > >>>>>
>> > >>>>> Kind regards,
>> > >>>>> Fokko Driesprong
>> > >>>>>
>> > >>>>> Op vr 7 apr 2023 om 14:10 schreef Jan Kaul
>> :
>> > >>>>>>
>> > >>>>>> Hi iceberg community,
>> > >>>>>>
>> > >>>>>> Like discussed in the last Iceberg Sync, it would be great to
>> have
>> > >>>>>> another meeting to discuss how to combine our efforts for a C++
>> and/or
>> > >>>>>> Rust SDK.
>> > >>>>>>
>> > >>>>>> Here are three possible dates for the Sync:
>> > >>>>>>
>> > >>>>>> 1. 18.04.23 16:00 UTC
>> > >>>>>>
>> > >>>>>> 2. 19.04.23 16:00 UTC
>> > >>>>>>
>> > >>>>>> 3. 20.04.23 16:00 UTC
>> > >>>>>>
>> > >>>>>> For those who want to join the meeting, it would be great if you
>> could
>> > >>>>>> answer this email with the dates that you are available. I will
>> then
>> > >>>>>> create an online meeting for the date where most people can join.
>> > >>>>>>
>> > >>>>>> I'm looking forward to talking to you.
>> > >>>>>>
>> > >>>>>> Best wishes,
>> > >>>>>>
>> > >>>>>> Jan Kaul
>> > >>>>>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Ryan Blue
>> > >>> Tabular
>> >
>
>


Re: [DISCUSS] Spark 3.1 support?

2023-04-20 Thread Fokko Driesprong
Spring cleaning! I checked which versions of Spark the cloud vendors are
supporting. Both AWS and GCP are already on 3.3. However, for Azure
,
Spark 3.3 is in preview and is still on 3.1.3. They are planning to upgrade
to Spark 3.2.0

and I think we're fine for the next release of Iceberg.

Kind regards,
Fokko

Op do 20 apr 2023 om 22:22 schreef Anton Okolnychyi
:

> Since there are no objections and it is in line with what we planned
> initially, I created a PR to drop 3.1.
>
> https://github.com/apache/iceberg/pull/7390
>
> - Anton
>
> On Apr 19, 2023, at 11:05 AM, Ryan Blue  wrote:
>
> +1
>
> As we said in the 2.4 discussion, the format itself should provide forward
> compatibility with tables and it is more clear that we aren't adding new
> features if you have to use older versions for Spark 3.1.
>
> On Wed, Apr 19, 2023 at 10:08 AM Anton Okolnychyi <
> aokolnyc...@apple.com.invalid> wrote:
>
>> Hey folks,
>>
>> What does everybody think about Spark 3.1 support after we add Spark 3.4
>> support? Our initial plan was to release jars for the last 3 versions. Are
>> there any blockers for dropping 3.1?
>>
>> - Anton
>
>
>
> --
> Ryan Blue
> Tabular
>
>
>


Re: What is the harm of adding partition to iceberg table?

2023-04-24 Thread Fokko Driesprong
Hi ZC C,

Adding partitions to Iceberg tables is easy, and changing them, later on,
is easy as well. The existing data will continue to exist with the
partition that it was initially written with, new data will be written
according to the active partitioning. When you rewrite the data (for
example using the Spark procedure rewrite_data_files
)
it will use the new partitioning strategy. Getting back to your question;
the harm is that you need to rewrite the data to benefit from the new
partitioning strategy.

Choosing the right partitioning strategy is something that often evolves
over time. You don't want to be too granular because that will create a lot
of files, which will cause more IO overhead. But also not too coarse since
that will read in a lot of data that you're not interested in. It helps to
look at your queries and see which fields are filtered on, and which are
your candidate partitions (one or a combination of).

Hope this helps,

Kind regards,
Fokko



Op ma 24 apr 2023 om 19:49 schreef ZC C :

> We now are create a row data table, and my colleague want to add org_id as
> the partition, What is the harm of adding partition to iceberg table?


Re: Welcome new committers and PMC!

2023-05-04 Thread Fokko Driesprong
Fantastic! Great having you all aboard.

Cheers, Fokko

Op do 4 mei 2023 om 07:40 schreef Gidon Gershinsky :

> Congratulations Amogh, Eduard, Szehon!
>
> Cheers, Gidon
>
>
> On Thu, May 4, 2023 at 7:59 AM Péter Váry 
> wrote:
>
>> Congratulations everyone!
>> Well deserved!
>>
>> On Thu, May 4, 2023, 03:42 Steve Zhang 
>> wrote:
>>
>>> Congrats everyone! Well deserved and great job!
>>>
>>> Thanks,
>>> Steve Zhang
>>>
>>>
>>>
>>> On May 3, 2023, at 5:52 PM, Prashant Singh 
>>> wrote:
>>>
>>> Congratulations, Amogh, Eduard, Szehon  Well deserved !
>>>
>>> On Wed, May 3, 2023 at 3:07 PM Steven Wu  wrote:
>>>
 Congrats, Amogh, Eduard, and Szehon! Well deserved. Your contributions
 are much appreciated!

 On Wed, May 3, 2023 at 2:25 PM Yufei Gu  wrote:

> Congratulations, Amogh, Eduard and Szehon! Great job!
>
> Best,
>
> Yufei
>
>
> On Wed, May 3, 2023 at 12:27 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Great news! It's so exciting to have the project continue to grow!
>>
>> > On May 3, 2023, at 2:06 PM, Ryan Blue  wrote:
>> >
>> > Hi everyone,
>> >
>> > I want to congratulate Amogh and Eduard, who were just added as
>> Ierberg committers and Szehon, who was just added to the PMC. Thanks for
>> all your contributions!
>> >
>> > Ryan
>> >
>> > --
>> > Ryan Blue
>>
>>
>>>


Re: Orphan files

2023-05-24 Thread Fokko Driesprong
Hey Gaurav,

Orphan files do not affect Iceberg's performance, since Iceberg performs no
list operations. It will only increase your storage bill since you have
files around that are not relevant anymore. iceberg tables do need periodic
maintenance, for example, it is good to rewrite small files

into
bigger ones to avoid many calls to your storage.

I hope this answers your question.

Kind regards,
Fokko



Op di 23 mei 2023 om 21:13 schreef Gaurav Agarwal :

> Hello
>
> We have orphan files in the table does it impact the read performance of
> the data from the table if we are passing the partition column in the read
> query?
>
> Would like to know what is the impact of orphan files ?
>
> Thanks
>


Re: [VOTE] Release Apache Iceberg 1.3.0 RC0

2023-05-25 Thread Fokko Driesprong
+1 (binding)

Thanks for running this Anton!

- Checked the signature and checksum
- Checked the licenses
- Build with JDK 11
- Tested against Trino 

Kind regards,
Fokko

Op do 25 mei 2023 om 05:18 schreef Ajantha Bhat :

> +1 (non-binding)
>
> - Verified Nessie integration testing with Spark-3.3_2.12_runtime jar.
> - Validated checksum and signature
> - Checked license docs & ran RAT checks
> - Verified build with JDK8
>
> Thanks,
> Ajantha
>
> On Thu, May 25, 2023 at 3:13 AM Szehon Ho  wrote:
>
>> +1 (binding)
>>
>> 1. verify signatures
>> 2. verify checksum
>> 3. verify license documentation
>> 4. build and run tests
>> 5. Ran simple tests on Spark 3.4
>> - Create simple table and check metadata tables
>> - Ran 'delete from' statement to generate position delete, and run
>> rewrite_position_delete
>>
>> Thanks
>> Szehon
>>
>> On Tue, May 23, 2023 at 1:21 PM Anton Okolnychyi
>>  wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we release the following RC as the official Apache
>>> Iceberg 1.3.0 release.
>>>
>>> The commit ID is 7dbdfd33a667a721fbb21c7c7d06fec9daa30b88
>>> * This corresponds to the tag: apache-iceberg-1.3.0-rc0
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.3.0-rc0
>>> *
>>> https://github.com/apache/iceberg/tree/7dbdfd33a667a721fbb21c7c7d06fec9daa30b88
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.3.0-rc0
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged on Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1134/
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours. (Weekends excluded)
>>>
>>> [ ] +1 Release this as Apache Iceberg 1.3.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>> Only PMC members have binding votes, but other community members are
>>> encouraged to cast
>>> non-binding votes. This vote will pass if there are 3 binding +1 votes
>>> and more binding
>>> +1 votes than -1 votes.
>>>
>>> - Anton
>>>
>>


[VOTE] Release PyIceberg 0.4.0 RC1

2023-06-26 Thread Fokko Driesprong
Hi Everyone,


Excited to start the 0.4.0 PyIceberg release process. The 0.4.0 release is
packed with cool features:

   - Support for converting Parquet schemas into Iceberg ones
   
   - Support for reading data using FSSpec
   .
   - Support fetching a limited number of rows
    to quickly peek into a
   dataset.
   - Reduced the number of calls
    to the object store with
   PyArrow>=12.0.0.
   - Speed up queries using the Iceberg metrics.
   
   - Ability to do SQL style filters
   : row_filter='passengers >=
   3'.|
   - SigV4 support  for the
   REST catalog.
   - A complete makeover  of
   the docs site.
   - Support for positional deletes
   .
   - Ability to set table properties
   .
   - And many bugs have been fixed
   

   !

 I propose that we release the following RC as the official PyIceberg 0.4.0
release. The commit ID is e85ec9447c08c1a21e9ef21278f3237811f3f67f


* This corresponds to the tag: pyiceberg-0.4.0rc1
(c3579a11b4bfa5387e313185e714c40a0ed1ccfe)

* https://github.com/apache/iceberg/releases/tag/pyiceberg-0.4.0rc1

*
https://github.com/apache/iceberg/tree/e85ec9447c08c1a21e9ef21278f3237811f3f67f


The release tarball, signature, and checksums are here:


* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.4.0rc1/


You can find the KEYS file here:


* https://dist.apache.org/repos/dist/dev/iceberg/KEYS


Convenience binary artifacts are staged on pypi:


https://pypi.org/project/pyiceberg/0.4.0rc1/


And can be installed using: pip3 install pyiceberg==0.4.0rc1


Please download, verify, and test.


Please vote in the next 72 hours.

[ ] +1 Release this as PyIceberg 0.4.0

[ ] +0

[ ] -1 Do not release this because...


Please consider this email a +1 from my side:


   - Ran some basic table scans
  - Including tables with positional deletes
   - Checked to see if everything still works when PyArrow is not installed
   - Set some table properties

Kind regards,

Fokko


[VOTE] Release PyIceberg 0.4.0 RC2

2023-06-27 Thread Fokko Driesprong
All,


Excited to start the 0.4.0 PyIceberg release process. The 0.4.0 release is
packed with awesome features:

   - Support for converting Parquet schemas into Iceberg ones
   
   - Support for reading data using FSSpec
   .
   - Support fetching a limited number of rows
    to quickly peek into a
   dataset.
   - Reduced the number of calls
    to the object store with
   PyArrow>=12.0.0.
   - Speed up queries using the Iceberg metrics.
   
   - Ability to do SQL style filters
   : row_filter='passengers >=
   3'.|
   - SigV4 support  for the
   REST catalog.
   - A complete makeover  of
   the docs site.
   - Support for positional deletes
   .
   - Ability to set table properties
   .
   - And many bugs have been fixed
   

   !

I propose that we release the following RC as the official PyIceberg 0.4.0
release. The commit ID is 51eaf6806361e6e0a5cd163071dce684ec05350b


* This corresponds to the tag: pyiceberg-0.4.0rc2 (
f81c759835672e956c71280394f432463d25463c)

* https://github.com/apache/iceberg/releases/tag/pyiceberg-0.4.0rc2

*
https://github.com/apache/iceberg/tree/51eaf6806361e6e0a5cd163071dce684ec05350b


The release tarball, signature, and checksums are here:


* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.4.0rc2/


You can find the KEYS file here:


* https://dist.apache.org/repos/dist/dev/iceberg/KEYS


Convenience binary artifacts are staged on pypi:


https://pypi.org/project/pyiceberg/0.4.0rc2/


And can be installed using: pip3 install pyiceberg==0.4.0rc2


Please download, verify, and test.


Please vote in the next 72 hours.

[ ] +1 Release this as PyIceberg 0.4.0

[ ] +0

[ ] -1 Do not release this because...


Please consider this email a +1 from my side:


   - Ran some basic table scans
  - Including tables with positional deletes
   - Checked to see if everything still works when PyArrow is not installed
   - Set some table properties

Kind regards,

Fokko


Re: [VOTE] Release PyIceberg 0.4.0 RC2

2023-07-03 Thread Fokko Driesprong
Thanks all! The vote has passed:

+1:
Ryan Blue (binding)
Jean-Baptiste Onofré (non-binding)
Jack Ye (binding)
Daniel Weeks (binding)
Jonas Jiang (non-binding)
Eduard Tutenhoefner (non-binding)
Fokko Driesprong (binding)

+/-0:
∅

-1:
∅

Thanks to everyone voting, and I'll publish the artifacts to PyPi shortly.

Kind regards,
Fokko

Op do 29 jun 2023 om 10:41 schreef Eduard Tudenhoefner :

> +1 (non-binding)
>
> Verified sigs/sums/license/test.
>
>
> Eduard
>
> On Thu, Jun 29, 2023 at 3:34 AM Jonas Jiang 
> wrote:
>
>> +1 (non-binding)
>>
>> Verified signature, checksum, license.
>>
>> Ran test, test-coverage, and some checks for conversion from parquet
>> schema to iceberg schema.
>>
>> Best regards,
>> Jonas Jiang
>>
>> On Wed, Jun 28, 2023 at 11:57 AM Daniel Weeks  wrote:
>>
>>> +1 (binding)
>>>
>>> Verified sigs/sums/license/test.
>>>
>>> Also ran some basic tests with row filtering and positional deletes.
>>>
>>> -Dan
>>>
>>> On Tue, Jun 27, 2023 at 10:58 PM Jack Ye  wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> Verified checksum, signature, license, test, test-s3.
>>>>
>>>> Ran basic checks for Glue catalog, also verified the row filter issue
>>>> is fixed:
>>>>
>>>> [image: Screenshot 2023-06-27 at 10.55.47 PM.png]
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> On Tue, Jun 27, 2023 at 10:25 PM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>>> +1 (non binding)
>>>>>
>>>>> I did quick tests and it looks good. Thanks!
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Tue, Jun 27, 2023 at 10:37 PM Fokko Driesprong 
>>>>> wrote:
>>>>> >
>>>>> > All,
>>>>> >
>>>>> >
>>>>> > Excited to start the 0.4.0 PyIceberg release process. The 0.4.0
>>>>> release is packed with awesome features:
>>>>> >
>>>>> > Support for converting Parquet schemas into Iceberg ones
>>>>> > Support for reading data using FSSpec.
>>>>> > Support fetching a limited number of rows to quickly peek into a
>>>>> dataset.
>>>>> > Reduced the number of calls to the object store with PyArrow>=12.0.0.
>>>>> > Speed up queries using the Iceberg metrics.
>>>>> > Ability to do SQL style filters: row_filter='passengers >= 3'.|
>>>>> > SigV4 support for the REST catalog.
>>>>> > A complete makeover of the docs site.
>>>>> > Support for positional deletes.
>>>>> > Ability to set table properties.
>>>>> > And many bugs have been fixed!
>>>>> >
>>>>> > I propose that we release the following RC as the official PyIceberg
>>>>> 0.4.0 release. The commit ID is 51eaf6806361e6e0a5cd163071dce684ec05350b
>>>>> >
>>>>> >
>>>>> > * This corresponds to the tag: pyiceberg-0.4.0rc2
>>>>> (f81c759835672e956c71280394f432463d25463c)
>>>>> >
>>>>> > * https://github.com/apache/iceberg/releases/tag/pyiceberg-0.4.0rc2
>>>>> >
>>>>> > *
>>>>> https://github.com/apache/iceberg/tree/51eaf6806361e6e0a5cd163071dce684ec05350b
>>>>> >
>>>>> >
>>>>> > The release tarball, signature, and checksums are here:
>>>>> >
>>>>> >
>>>>> > * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.4.0rc2/
>>>>> >
>>>>> >
>>>>> > You can find the KEYS file here:
>>>>> >
>>>>> >
>>>>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>> >
>>>>> >
>>>>> > Convenience binary artifacts are staged on pypi:
>>>>> >
>>>>> >
>>>>> > https://pypi.org/project/pyiceberg/0.4.0rc2/
>>>>> >
>>>>> >
>>>>> > And can be installed using: pip3 install pyiceberg==0.4.0rc2
>>>>> >
>>>>> >
>>>>> > Please download, verify, and test.
>>>>> >
>>>>> >
>>>>> > Please vote in the next 72 hours.
>>>>> >
>>>>> > [ ] +1 Release this as PyIceberg 0.4.0
>>>>> >
>>>>> > [ ] +0
>>>>> >
>>>>> > [ ] -1 Do not release this because...
>>>>> >
>>>>> >
>>>>> > Please consider this email a +1 from my side:
>>>>> >
>>>>> > Ran some basic table scans
>>>>> >
>>>>> > Including tables with positional deletes
>>>>> >
>>>>> > Checked to see if everything still works when PyArrow is not
>>>>> installed
>>>>> > Set some table properties
>>>>> >
>>>>> > Kind regards,
>>>>> >
>>>>> > Fokko
>>>>>
>>>>


[ANNOUNCE] Apache PyIceberg release 0.4.0

2023-07-03 Thread Fokko Driesprong
Hi everyone!

I'm pleased to announce the release of Apache PyIceberg 0.4.0!

Apache Iceberg is an open table format for huge analytic datasets. Iceberg
delivers high query performance for tables with tens of petabytes of data,
along with atomic commits, concurrent writes, and SQL-compatible table
evolution. PyIceberg is an implementation to read from these datasets.

Major features of 0.4.0:

   - Support for converting Parquet schemas into Iceberg ones
   
   - Support for reading data using FSSpec
   .
   - Support fetching a limited number of rows
    to quickly peek into a
   dataset.
   - Reduced the number of calls
    to the object store with
   PyArrow>=12.0.0.
   - Speed up queries using the Iceberg metrics.
   
   - Ability to do SQL style filters
   : row_filter='passengers >=
   3'.
   - SigV4 support  for the
   REST catalog.
   - A complete makeover  of
   the docs site.
   - Support for positional deletes
   .
   - Ability to set table properties
   .
   - And many bugs have been fixed
   

   !

The PyIceberg release can be downloaded from:
https://pypi.org/project/pyiceberg/0.4.0/
And the docs makeover can be seen here: https://py.iceberg.apache.org/
😍


Thanks to everyone for contributing, and looking forward to 0.5.0!

Kind regards,
Fokko


Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-12 Thread Fokko Driesprong
Hi Szehon,

Thank you for the updates. I'm in favor of 1.3.1 as well. I got notified of
a discrepancy  in Java's
TableMetadata reader today. I have a fix here
 against the master branch.
Once that is in, I think it would be great to backport this to 1.3.x as
well.

Kind regards,
Fokko

Op wo 12 jul 2023 om 22:09 schreef Szehon Ho :

> Hi guys
>
> Just an update on this.  Another issue came up about the new 1.3.0
> function rewrite_position_deletes (thanks Fokko for adding to the
> milestone).  I'm working on that, hopefully can finish in next day or two,
> for this release.
>
> Milestone for reference:
> https://github.com/apache/iceberg/milestones/Iceberg%201.3.1
>
> Thanks
> Szehon
>
> On Mon, Jul 10, 2023 at 11:14 AM Szehon Ho 
> wrote:
>
>> Thanks Eduard!  Merged all your backport prs, I will commit the last one
>> probably tomorrow and then we can start the release.
>>
>> Thanks
>> Szehon
>>
>> On Sun, Jul 9, 2023 at 11:53 PM Eduard Tudenhoefner 
>> wrote:
>>
>>> I created a 1.3.x 
>>> branch, so that we can start backporting those bug fixes.
>>>
>>> Eduard
>>>
>>> On Fri, Jul 7, 2023 at 6:52 PM Szehon Ho 
>>> wrote:
>>>
 Thanks a lot Eduard!  I think
 https://github.com/apache/iceberg/pull/7933 is also a good candidate
 as well.

 Thanks,
 Szehon

 On Fri, Jul 7, 2023 at 9:07 AM Eduard Tudenhoefner 
 wrote:

> +1 for a 1.3.1 release. I've created a 1.3.1 Milestone
> 
> and it would be great to also get #7621
>  in.
>
> Eduard
>
> On Fri, Jul 7, 2023 at 5:52 PM Ryan Blue  wrote:
>
>> +1 for a 1.3.1 to fix the Hive issue.
>>
>> For the Nessie changes, those seem outside what we would normally put
>> in a patch release. Patch releases are for bug fixes and aren't usually a
>> time to get other changes in for convenience. I can understand wanting to
>> unblock a Trino issue, but it doesn't seem like a good choice to me.
>>
>> In addition, why not put some of these classes in the Nessie project
>> itself? Could NessieUtil go there so that you aren't waiting on Iceberg
>> releases to fix third-party projects?
>>
>> Ryan
>>
>> On Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> It sounds good to me to have 1.3.1.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho 
>>> wrote:
>>> >
>>> > Hi
>>> >
>>> > I wanted to start a discussion for whether its the right time for
>>> 1.3.1, a patch release of 1.3.0.  It was started based on the issue 
>>> found
>>> by Xiangyang (@ConeyLiu) :
>>> https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277
>>> .
>>> >
>>> > Do people have any other bug fixes that should be included?  Also
>>> let me know, if anyone wants to be a release manager?  If not, I can 
>>> give
>>> it a shot as well.
>>> >
>>> > Thanks,
>>> > Szehon
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: [VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-18 Thread Fokko Driesprong
Hi Szehon,

+1 (binding)

   - Checked the signature and hash
   - Ran the RAT checks
   - Did a local build and ran the tests (all passed, except
   TestS3RestSigner tests, since #7742
    is not backported).
   - Ran against Trino master 

Thanks for running the release!

Cheers, Fokko

Op ma 17 jul 2023 om 20:01 schreef Szehon Ho :

> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 1.3.1 release.
>
> The commit ID is 62c34711c3f22e520db65c51255512f6cfe622c4
> * This corresponds to the tag: apache-iceberg-1.3.1-rc1
> * https://github.com/apache/iceberg/commits/apache-iceberg-1.3.1-rc1
> *
> https://github.com/apache/iceberg/tree/62c34711c3f22e520db65c51255512f6cfe622c4
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.3.1-rc1
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1141/
>
> This release includes several important bug fixes over 1.3.0, including:
> * Fix Spark RewritePositionDeleteFiles failure for certain partition types
> (#8059)
> * Fix Spark RewriteDataFiles concurrency edge-case on commit timeouts
> (#7933)
> * Table Metadata parser now accepts null current-snapshot-id, properties,
> snapshots fields (#8064)
> * FlinkCatalog creation no longer creates the default database (#8039)
> * Fix loading certain V1 table branch snapshots using snapshot references
> (#7621)
> * Fix Spark partition-level DELETE operations for WAP branches (#7900)
> * Fix HiveCatalog deleting metadata on failures in checking lock status
> (#7931)
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours. (Weekends excluded)
>
> [ ] +1 Release this as Apache Iceberg 1.3.1
> [ ] +0
> [ ] -1 Do not release this because...
>
> Only PMC members have binding votes, but other community members are
> encouraged to cast
> non-binding votes. This vote will pass if there are 3 binding +1 votes and
> more binding
> +1 votes than -1 votes.
>
> Thanks
> Szehon
>


Re: Location of rust repo

2023-07-21 Thread Fokko Driesprong
Thank you for the context, Chan. This morning I created a separate
repository (iceberg-rust ) for the
rust implementation. With PyIceberg we already did separate releases from
the Java artifacts, also the versioning is different. I think this is an
excellent decision since I've experienced similar issues with Avro where a
release was done, but nothing changed for many languages, and this added a
lot of friction to the process.

Kind regards,
Fokko

Op vr 21 jul 2023 om 16:09 schreef xxchan :

> I'd like to mention Arrow-rs's decision to move out of the main repo. From
> my understanding, the most important reason is that Rust crates tend to
> release more often so that downstream can enjoy minor or patch updates more
> often, but previously releases of Arrow-rs are in lockstep with other
> language implementations and happen every 4 months.
>
> - https://arrow.apache.org/blog/2021/05/04/rust-dev-workflow/
> - https://lists.apache.org/thread/t7tb6kpgxnpjs120jq04r5nrbq0rpdjl
> - https://lists.apache.org/thread/4f4sm78somg0n9710w5qftc6hgbc9p3r
>
> On 2023/07/19 09:30:34 Jan Kaul wrote:
> > Hey all,
> >
> > we just had our first sync for the rust iceberg developers and it was
> > great to talk to everyone.
> >
> > The most important point that came up was the location where the rust
> > development should take place. The two options are either to have a
> > separate "iceberg-rust" repository or to create a "rust" folder in the
> > existing apache/iceberg repository.
> >
> > The benefits of a separate repository are separate CI, simpler merging
> > of PRs and a more scalable solution if more languages are added.
> >
> > The benefits of a subfolder in the existing repository are more
> > visibility, easier coordination with the java project and more feedback
> > from the community.
> >
> > The developers currently working on the rust implementation slightly
> > favor a separate repository but would be okay with using the existing
> > repository.
> >
> >
> > It would be great if you could share your opinions on the topic. Maybe
> > this could also be a point for the community sync later today.
> >
> > Hope you're all doing well. Best wishes,
> >
> > Jan
> >
> >
>


Re: Proposal to fix the docs - this time it'll be different

2023-07-27 Thread Fokko Driesprong
Hey Brian,

Thanks for raising this. As a release manager, I can confirm that the
current structure is confusing, and I can also see the community
struggling with this because they are willing to contribute to the docs,
but cannot always find the place where to do this. I think the complexity
of the current website mostly comes from the versioned docs. It would be
great if we can find a way to make this easier. Instead of using the
branches, we could also use the release tags and build the docs for those
versions.

I think switching to mkdocs-material is a great idea. We currently also use
this for PyIceberg, and it works really well. My main concern is around
merging everything together. Should we combine Java and Python in the same
documentation? They have a different versioning scheme, so that would
create a matrix of versions. Go and Rust
 is also in the making, so
that would explode at some point.

Cheers, Fokko

Ps. Currently, PyIceberg uses the gh-pages branch for publishing the docs
.


Op do 27 jul 2023 om 00:04 schreef Brian Olsen :

> Hey all,
>
> I have some proposals I'd like to make to fixing the docs. I would want to
> do this in two phases.
>
> The first phase I'm proposing that we locate all the documentation
> (reference docs, website, and pyIceberg) back into the apache/iceberg
> repository. I explain my reasoning in the attached document. This phase
> would also update us from Hugo to MkDocs but keep all the content the same.
>
> The second phase, is focused on iteratively building out the content that
> we've marked missing in some the proposal that Sam R. created along with a
> recent community member, Mahfuza. We will also restructure the content to
> following the diátaxis method (https://diataxis.fr/).
>
>
> https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit#heading=h.gli9mc2ghfz1
>
> Let me know what you think and bring on the questions and criticisms
> please! :)
>
> Bits
>


Re: Discussion about the location of language clients

2023-08-10 Thread Fokko Driesprong
Hi everyone,

Today I took a stab at the generation of wheels in Python (here's the PR
 if anyone is interested), and
when testing this it would also kick off many unrelated CI jobs. This is
just for two languages, and I'm not convinced that it will scale to many
languages. Also, having a different release cycle for each of the languages
will clutter up the tags, releases, etc. I'm convinced that
separate repositories are more scalable in the future, we just have to make
sure that they can be found easily (rename apache/iceberg to
apache/iceberg-java?).

Cheers, Fokko



Op do 10 aug 2023 om 14:18 schreef Jan Kaul :

> Hi all,
>
> first off, thanks Brian for starting the conversation and thanks Renjie
> for the write up.
>
> I'm also in the camp multi-repo because of the already mentioned benefits.
>
> One point I would like to add is that the potential drawback of having
> less visibility with multi-repos can be mitigated to some extent. I think
> that if the different repos are clearly and visibly presented on the
> iceberg website people should be able to find the desired implementation.
>
> Best wishes,
>
> Jan
> On 10.08.23 13:43, Brian Olsen wrote:
>
> Renjie, you're amazing.
>
> I think you summarized this better than I could, so thank you for that.
>
> I'd like to pull in a user's feedback on Slack
>
> FWIW, I’m personally a fan of separate repos for the client libraries.
>> It keeps things more a bit more isolated (in a good way) and explorable
>> (rather than overwhelming). GitHub search is a bit easier to use. And I
>> think it generally lowers the bar to contributing. Independent versioning,
>> and GitHub releases are a big win too, I think.
>>
>
> Right now, I don’t actually know where to find PyIceberg release notes.
>> Would love to see release notes in the GitHub releases for them.
>
>
>
> IMO, The most important measurement of success for choosing either of
> these options is about making the contributor experience as smooth as
> possible.
>
> Monorepo has the advantage of one place to look, all changes across
> core/clients can be modeled in a single PR, and sharing resources. At
> first, I considered managing the build to only be a problem for Iceberg
> committers managing the build, but ultimately this is setting us up for a
> longer build and running unnecessary infrastructure for unrelated tasks.
> There is definitely ways that we can verify what parts of the code have
> been changed and which code should be run, but it will not always be clear
> or simple to know if we tested too much or not enough.
>
> For that, I am also in the multi-repo camp (for clients). I think despite
> having to manage different repos for each client, I generally consider the
> work of each client to be independent of the work happening in the main
> repo. In this view, it's possibly better that the work be independent and
> seen on its own. The biggest win IMO is the intentional separation of
> testing and deployment infrastructure. This will make for a better
> experience when folks are contributing, testing, and looking for release
> notes.
>
> But I also really don't care as long as we do the same things across
> clients. ;)
>
> Bits
>
>
> On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu 
> wrote:
>
>> Hi, all:
>>
>>
>>
>> In yesterday’s community sync we talked about the location of different
>> language clients, and I think we all agree that there should be consistent
>> behavior for these clients, but the decision has not been made yet. I want
>> to continue the discussion here on the pros and cons of different sides:
>> mono repo(all in one big repo) or multi small repos( one for each language
>> client)
>>
>>
>>
>> To make things clear, currently we have four language libraries under
>> development:
>>
>>
>>
>>1. Java: in main repo(https://github.com/apache/iceberg)
>>2. Python: in main repo (https://github.com/apache/iceberg)
>>3. Go: in main repo (https://github.com/apache/iceberg)
>>4. Rust: in standalone repo (https://github.com/apache/iceberg-rust/)
>>
>>
>>
>> Currently I mainly contribute rust client and I can share the thoughts on
>> why I voted for standalone repo:
>>
>>
>>
>>1. Easier project setup. Iceberg is a complex project with several
>>components, and mainly written in java. As someone not quite familiar with
>>this project structure, I feel easier to start a new one rather fitting
>>into an existing one.
>>2. Faster ci workflow. In early days of rust client’s development, we
>>only need to touch rust related code. If we all live in one mono repo, it
>>will trigger unnecessary ci to run for other components.
>>
>>
>>
>> I admit that these reasons may not stand for long term maintains, but
>> it’s good for fast-paced development in early days.
>>
>>
>>
>> After reviewing some discussions on the web, I have a summary about the
>> pros and cons of two sides:
>>
>>
>>
>> Mono Repo
>>
>>
>>
>> Pros
>>
>>  

Re: Discussion about the location of language clients

2023-08-10 Thread Fokko Driesprong
I should have mentioned that Github does automatic redirection when you do
a rename of a repository. But you're all right, the impact is possibly
bigger than we can envision and it is probably not worth it.

I took the liberty of creating iceberg-python
<https://github.com/apache/icebergp-python> and iceberg-go
<https://github.com/apache/iceberg-go>. For Python, I'd love to do a
release this month, I think right after that (hopefully most PRs are in),
it would be a good moment to split out the Python part from the Java
repository.

Kind regards,
Fokko


Op vr 11 aug 2023 om 04:08 schreef Renjie Liu :

> Thanks everyone for nice discussion.
>
>
>
> +1 for multi repo while keeping core spec and java implementation in
> apache/iceberg. Currently java is still most widely adopted and
> sophisticated implementation. We only need to help people to find other
> implementation by providing links in web page.
>
>
>
>
>
> *From: *Ryan Blue 
> *Date: *Friday, August 11, 2023 at 05:23
> *To: *dev@iceberg.apache.org 
> *Subject: *Re: Discussion about the location of language clients
>
> I wasn't at the discussion on Wednesday, but it sounds like there is
> support for moving to separate repos. Does anyone strongly object?
>
>
>
> I also agree with Steven on not renaming to iceberg-java. That's the repo
> where we keep the spec and Java is the reference implementation. Plus we
> don't want to break a ton of links.
>
>
>
> Ryan
>
>
>
> On Thu, Aug 10, 2023 at 1:05 PM Steven Wu  wrote:
>
> I am also on the side of separate repos for different languages.
> otherwise, the main repo can grow too big. iceberg.apache.org website can
> provide proper links to repos for different languages.
>
>
>
> I would be -1 on renaming apache/iceberg to apache/iceberg-java, as it can
> break external links to the main/original github repo. the tradeoff may not
> be worth it.
>
>
>
> On Thu, Aug 10, 2023 at 8:16 AM Fokko Driesprong  wrote:
>
> Hi everyone,
>
>
>
> Today I took a stab at the generation of wheels in Python (here's the PR
> <https://github.com/apache/iceberg/pull/8287> if anyone is interested),
> and when testing this it would also kick off many unrelated CI jobs. This
> is just for two languages, and I'm not convinced that it will scale to many
> languages. Also, having a different release cycle for each of the languages
> will clutter up the tags, releases, etc. I'm convinced that
> separate repositories are more scalable in the future, we just have to make
> sure that they can be found easily (rename apache/iceberg to
> apache/iceberg-java?).
>
>
>
> Cheers, Fokko
>
>
>
>
>
>
>
> Op do 10 aug 2023 om 14:18 schreef Jan Kaul :
>
> Hi all,
>
> first off, thanks Brian for starting the conversation and thanks Renjie
> for the write up.
>
> I'm also in the camp multi-repo because of the already mentioned benefits.
>
> One point I would like to add is that the potential drawback of having
> less visibility with multi-repos can be mitigated to some extent. I think
> that if the different repos are clearly and visibly presented on the
> iceberg website people should be able to find the desired implementation.
>
> Best wishes,
>
> Jan
>
> On 10.08.23 13:43, Brian Olsen wrote:
>
> Renjie, you're amazing.
>
> I think you summarized this better than I could, so thank you for that.
>
> I'd like to pull in a user's feedback on Slack
>
> FWIW, I’m personally a fan of separate repos for the client libraries.
> It keeps things more a bit more isolated (in a good way) and explorable
> (rather than overwhelming). GitHub search is a bit easier to use. And I
> think it generally lowers the bar to contributing. Independent versioning,
> and GitHub releases are a big win too, I think.
>
>
>
> Right now, I don’t actually know where to find PyIceberg release notes.
> Would love to see release notes in the GitHub releases for them.
>
>
>
>
>
> IMO, The most important measurement of success for choosing either of
> these options is about making the contributor experience as smooth as
> possible.
>
> Monorepo has the advantage of one place to look, all changes across
> core/clients can be modeled in a single PR, and sharing resources. At
> first, I considered managing the build to only be a problem for Iceberg
> committers managing the build, but ultimately this is setting us up for a
> longer build and running unnecessary infrastructure for unrelated tasks.
> There is definitely ways that we can verify what parts of the code have
> been changed and which code should be run, but it will not always be 

PyIceberg 0.5.0 release

2023-08-15 Thread Fokko Driesprong
Hi everyone,

As mentioned in the latest Iceberg sync, I'd love to do another Python
release. I would like to reach out to the community to see if there is
anything we should include.

I know that it was promised that the next release would have to write
support, but so many features have already accumulated, and we want don't
want to rush write support, I would suggest doing a release in between to
get it out to the public.

A summary of what's on master:

   - Add gzip metadata support 
   - PyArrow HDFS support 
   - Support serverless environments (AWS Lambda)
   
   - Many fixes around Avro performance (PRs 1
   , 2
   , 3
   , 4
   )
   - Remove the upper bound of PyParsing dependency
    (blocking a PR in Airflow
   )
   - Moving the reading of Avro to Cython
    (10x speed improvement(!))
   - Support for the SQLCatalog
    (JDBC in Java)
   - Fix support for UUID columns
   
   - A lot of bugfixes!

What I think we should include (but please feel free to add anything to
this list):

   - Support for adding columns
   
   - Optimize concurrency 
   (follow up on the Support servless environments)
   - Bump Pydantic to v2 
   (this will unblock PyIceberg Polars integration
   , and improve performance
   of the JSON (de)serialization)

I've added those to the 0.5.0 milestone
. Feel free to reach out if
you think anything is missing, otherwise we can proceed with the release.
Also, everyone is invited to review the open PR's :)

Cheers, Fokko


[VOTE] Release Apache PyIceberg 0.5.0

2023-09-05 Thread Fokko Driesprong
Hi everyone

I propose that we release the following RC as the official PyIceberg 0.5.0
release.

The commit ID is 5bd7c649e4743a61eace5f52517db9b5b56ff8e6

* This corresponds to the tag: pyiceberg-0.5.0rc1 (
4f314fc507dec4ae918d3a3dfba567a28f92ac22)
* https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc1
*
https://github.com/apache/iceberg/tree/5bd7c649e4743a61eace5f52517db9b5b56ff8e6

The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/

You can find the KEYS file here:

* https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.5.0rc1/

And can be installed using: pip3 install pyiceberg==0.5.0rc1

Since a lot has changed due to the release of the wheels (binary Python
libraries), I've included the following steps to verify the release
:

curl https://dist.apache.org/repos/dist/dev/iceberg/KEYS -o KEYS
gpg --import KEYS

svn checkout
https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
/tmp/pyiceberg/

for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
/tmp/pyiceberg/pyiceberg-*.tar.gz)
do
gpg --verify ${name}.asc ${name}
done

cd  /tmp/pyiceberg/
for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl.asc.sha512
/tmp/pyiceberg/pyiceberg-*.tar.gz.asc.sha512)
do
shasum -a 512 --check ${name}
done

tar xzf pyiceberg-0.5.0.tar.gz
cd pyiceberg-0.5.0

./dev/check-license

Please download, verify, and test.

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.5.0
[ ] +0
[ ] -1 Do not release this because...

Consider this my +1 (binding), I've tested the license, and checksums and
ran example notebooks against the 0.5.0 rc1
.

Cheers, Fokko


Re: [VOTE] Release Apache PyIceberg 0.5.0

2023-09-09 Thread Fokko Driesprong
Hey everyone,

Thanks for casting the vote, appreciate it. I would like to cancel this RC,
and run RC2 which will include 3 PRs:

   - Python: Non-Cython fallback Avro parser
   <https://github.com/apache/iceberg/pull/8521>
   - Python: Fix pyarrow hdfs support
   <https://github.com/apache/iceberg/pull/8524>
   - Python: Issue with Windows cython build
   <https://github.com/apache/iceberg/issues/8530> (still an issue, working
   on a fix)

While we don't officially support Windows (Anybody willing to add it to the
CI? :D), it is evident that support is broken with the new release, and I
think that we should fix that regression. If there is anything that you
want to include as well, please let me know.

Cheers,
Fokko


Op za 9 sep 2023 om 09:37 schreef Jonas Jiang :

> +1 (non-binding)
>
> Verified signature, checksum, license using the updated steps
> Ran tests via "make test-coverage"
> Ran glue integration tests
>
> Best regards,
> Jonas
>
> On Fri, Sep 8, 2023 at 3:19 PM Hussein Awala  wrote:
>
>> +1 (non binding) I ran the example notebooks and tested some queries
>> with PyArrow and Pandas
>>
>> On Tue, Sep 5, 2023 at 9:21 PM Fokko Driesprong  wrote:
>>
>>> Hi everyone
>>>
>>> I propose that we release the following RC as the official PyIceberg
>>> 0.5.0 release.
>>>
>>> The commit ID is 5bd7c649e4743a61eace5f52517db9b5b56ff8e6
>>>
>>> * This corresponds to the tag: pyiceberg-0.5.0rc1 (
>>> 4f314fc507dec4ae918d3a3dfba567a28f92ac22)
>>> * https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc1
>>> *
>>> https://github.com/apache/iceberg/tree/5bd7c649e4743a61eace5f52517db9b5b56ff8e6
>>>
>>> The release tarball, signature, and checksums are here:
>>>
>>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
>>>
>>> You can find the KEYS file here:
>>>
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged on pypi:
>>>
>>> https://pypi.org/project/pyiceberg/0.5.0rc1/
>>>
>>> And can be installed using: pip3 install pyiceberg==0.5.0rc1
>>>
>>> Since a lot has changed due to the release of the wheels (binary Python
>>> libraries), I've included the following steps to verify the release
>>> <https://github.com/apache/iceberg/pull/8504>:
>>>
>>> curl https://dist.apache.org/repos/dist/dev/iceberg/KEYS -o KEYS
>>> gpg --import KEYS
>>>
>>> svn checkout
>>> https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
>>> /tmp/pyiceberg/
>>>
>>> for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
>>> /tmp/pyiceberg/pyiceberg-*.tar.gz)
>>> do
>>> gpg --verify ${name}.asc ${name}
>>> done
>>>
>>> cd  /tmp/pyiceberg/
>>> for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl.asc.sha512
>>> /tmp/pyiceberg/pyiceberg-*.tar.gz.asc.sha512)
>>> do
>>> shasum -a 512 --check ${name}
>>> done
>>>
>>> tar xzf pyiceberg-0.5.0.tar.gz
>>> cd pyiceberg-0.5.0
>>>
>>> ./dev/check-license
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>> [ ] +1 Release this as PyIceberg 0.5.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>> Consider this my +1 (binding), I've tested the license, and checksums
>>> and ran example notebooks against the 0.5.0 rc1
>>> <https://github.com/tabular-io/docker-spark-iceberg/pull/92>.
>>>
>>> Cheers, Fokko
>>>
>>


Re: [VOTE] Release Apache PyIceberg 0.5.0

2023-09-09 Thread Fokko Driesprong
An update from my end. There is a PR ready to run the Avro decoder tests
<https://github.com/apache/iceberg/pull/8532/> against the binary
wheel. I'm able to reproduce the issue on my end, and after including the
`.pyd` files I noticed that the tests failed
<https://github.com/apache/iceberg/issues/8530#issuecomment-1712482320>. Rusty
jumped in
<https://github.com/apache/iceberg/issues/8530#issuecomment-1712493430> and
we both learned that an unsigned long on Windows is 32bits. Once those are
in, we're ready for another RC.

Cheers,
Fokko

Op za 9 sep 2023 om 09:48 schreef Fokko Driesprong :

> Hey everyone,
>
> Thanks for casting the vote, appreciate it. I would like to cancel this
> RC, and run RC2 which will include 3 PRs:
>
>- Python: Non-Cython fallback Avro parser
><https://github.com/apache/iceberg/pull/8521>
>- Python: Fix pyarrow hdfs support
><https://github.com/apache/iceberg/pull/8524>
>- Python: Issue with Windows cython build
><https://github.com/apache/iceberg/issues/8530> (still an issue,
>working on a fix)
>
> While we don't officially support Windows (Anybody willing to add it to
> the CI? :D), it is evident that support is broken with the new release, and
> I think that we should fix that regression. If there is anything that you
> want to include as well, please let me know.
>
> Cheers,
> Fokko
>
>
> Op za 9 sep 2023 om 09:37 schreef Jonas Jiang :
>
>> +1 (non-binding)
>>
>> Verified signature, checksum, license using the updated steps
>> Ran tests via "make test-coverage"
>> Ran glue integration tests
>>
>> Best regards,
>> Jonas
>>
>> On Fri, Sep 8, 2023 at 3:19 PM Hussein Awala  wrote:
>>
>>> +1 (non binding) I ran the example notebooks and tested some queries
>>> with PyArrow and Pandas
>>>
>>> On Tue, Sep 5, 2023 at 9:21 PM Fokko Driesprong 
>>> wrote:
>>>
>>>> Hi everyone
>>>>
>>>> I propose that we release the following RC as the official PyIceberg
>>>> 0.5.0 release.
>>>>
>>>> The commit ID is 5bd7c649e4743a61eace5f52517db9b5b56ff8e6
>>>>
>>>> * This corresponds to the tag: pyiceberg-0.5.0rc1 (
>>>> 4f314fc507dec4ae918d3a3dfba567a28f92ac22)
>>>> * https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc1
>>>> *
>>>> https://github.com/apache/iceberg/tree/5bd7c649e4743a61eace5f52517db9b5b56ff8e6
>>>>
>>>> The release tarball, signature, and checksums are here:
>>>>
>>>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
>>>>
>>>> You can find the KEYS file here:
>>>>
>>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>
>>>> Convenience binary artifacts are staged on pypi:
>>>>
>>>> https://pypi.org/project/pyiceberg/0.5.0rc1/
>>>>
>>>> And can be installed using: pip3 install pyiceberg==0.5.0rc1
>>>>
>>>> Since a lot has changed due to the release of the wheels (binary Python
>>>> libraries), I've included the following steps to verify the release
>>>> <https://github.com/apache/iceberg/pull/8504>:
>>>>
>>>> curl https://dist.apache.org/repos/dist/dev/iceberg/KEYS -o KEYS
>>>> gpg --import KEYS
>>>>
>>>> svn checkout
>>>> https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
>>>> /tmp/pyiceberg/
>>>>
>>>> for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
>>>> /tmp/pyiceberg/pyiceberg-*.tar.gz)
>>>> do
>>>> gpg --verify ${name}.asc ${name}
>>>> done
>>>>
>>>> cd  /tmp/pyiceberg/
>>>> for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl.asc.sha512
>>>> /tmp/pyiceberg/pyiceberg-*.tar.gz.asc.sha512)
>>>> do
>>>> shasum -a 512 --check ${name}
>>>> done
>>>>
>>>> tar xzf pyiceberg-0.5.0.tar.gz
>>>> cd pyiceberg-0.5.0
>>>>
>>>> ./dev/check-license
>>>>
>>>> Please download, verify, and test.
>>>>
>>>> Please vote in the next 72 hours.
>>>> [ ] +1 Release this as PyIceberg 0.5.0
>>>> [ ] +0
>>>> [ ] -1 Do not release this because...
>>>>
>>>> Consider this my +1 (binding), I've tested the license, and checksums
>>>> and ran example notebooks against the 0.5.0 rc1
>>>> <https://github.com/tabular-io/docker-spark-iceberg/pull/92>.
>>>>
>>>> Cheers, Fokko
>>>>
>>>


[VOTE] Release Apache PyIceberg 0.5.0

2023-09-11 Thread Fokko Driesprong
Hi Everyone,

I propose that we release the following RC as the official PyIceberg 0.5.0
release. A summary of what's included in 0.5.0:

   - Add gzip metadata support <https://github.com/apache/iceberg/pull/7984>
   - PyArrow HDFS support <https://github.com/apache/iceberg/pull/7997>
   - Support serverless environments (AWS Lambda)
   <https://github.com/apache/iceberg/pull/8061>
   - Many fixes around Avro performance (PRs 1
   <https://github.com/apache/iceberg/pull/8074>, 2
   <https://github.com/apache/iceberg/pull/8075>, 3
   <https://github.com/apache/iceberg/pull/8082>, 4
   <https://github.com/apache/iceberg/pull/8084>)
   - Remove the upper bound of PyParsing dependency
   <https://github.com/apache/iceberg/pull/8116> (blocking a PR in Airflow
   <https://github.com/apache/airflow/pull/32786>)
   - Moving the reading of Avro to Cython
   <https://github.com/apache/iceberg/pull/8134> (10x speed improvement(!))
   - Support for the SQLCatalog
   <https://github.com/apache/iceberg/pull/7921> (JDBC in Java)
   - Fix support for UUID columns
   <https://github.com/apache/iceberg/pull/8267>
   - Support for adding columns
   <https://github.com/apache/iceberg/pull/8174>
   - Optimize concurrency <https://github.com/apache/iceberg/pull/8104> (follow
   up on the Support servless environments)
   - Bump Pydantic to v2 <https://github.com/apache/iceberg/pull/7782>
(improved
   performance of the JSON (de)serialization)
   - A lot of bugfixes!

The commit ID is 3323281045a72f1156d58c261067469e383fb26d

* This corresponds to the tag: pyiceberg-0.5.0rc2
(92600935834bdf77ba37ac361338712713549a77)
* https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc2
*
https://github.com/apache/iceberg/tree/3323281045a72f1156d58c261067469e383fb26d

The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc2/

You can find the KEYS file here:

* https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.5.0rc2/

And can be installed using: pip3 install pyiceberg==0.5.0rc2

Since a lot has changed due to the release of the wheels (binary Python
libraries), I've included the following steps to verify the release:

curl https://dist.apache.org/repos/dist/dev/iceberg/KEYS -o KEYS
gpg --import KEYS

svn checkout
https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc1/
/tmp/pyiceberg/

for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
/tmp/pyiceberg/pyiceberg-*.tar.gz)
do
gpg --verify ${name}.asc ${name}
done

cd  /tmp/pyiceberg/
for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl.asc.sha512
/tmp/pyiceberg/pyiceberg-*.tar.gz.asc.sha512)
do
shasum -a 512 --check ${name}
done

tar xzf pyiceberg-0.5.0.tar.gz
cd pyiceberg-0.5.0

./dev/check-license

Please download, verify, and test.

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.5.0
[ ] +0
[ ] -1 Do not release this because...

Please consider this my +1, I've checked against the docker-spark-iceberg
<https://github.com/tabular-io/docker-spark-iceberg/pull/92> notebook, and
did some checks.

Kind regards,
Fokko Driesprong


[VOTE] Release Apache PyIceberg 0.5.0 RC3

2023-09-13 Thread Fokko Driesprong
Hi Everyone,


I propose that we release the following RC as the official PyIceberg 0.5.0
release. This includes the performance issue that was discovered in RC2. A
summary of what's included in 0.5.0:

   - Add gzip metadata support 
   - PyArrow HDFS support 
   - Support serverless environments (AWS Lambda)
   
   - Many fixes around Avro performance (PRs 1
   , 2
   , 3
   , 4
   )
   - Remove the upper bound of PyParsing dependency
    (blocking a PR in Airflow
   )
   - Moving the reading of Avro to Cython
    (10x speed improvement(!))
   - Support for the SQLCatalog
    (JDBC in Java)
   - Fix support for UUID columns
   
   - Support for adding columns
   
   - Optimize concurrency  (follow
   up on the Support serverless environments)
   - Bump Pydantic to v2 
(improved
   performance of the JSON (de)serialization)
   - A lot of bugfixes!

The commit ID is f798b06246e67131d413dfceece5ccaf269e01fe


   - This corresponds to the tag: pyiceberg-0.5.0rc3
   (37fa779b0957644590a03754a733a5b3e3f589d0)
   - https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc3
   -
   
https://github.com/apache/iceberg/tree/f798b06246e67131d413dfceece5ccaf269e01fe

The release tarball, signature, and checksums are here:


   - https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc3/

You can find the KEYS file here:


   - https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on pypi:


https://pypi.org/project/pyiceberg/0.5.0rc3/


And can be installed using: pip3 install pyiceberg==0.5.0rc3


Please download, verify, and test.


Please vote in the next 72 hours.


[ ] +1 Release this as PyIceberg 0.5.0

[ ] +0

[ ] -1 Do not release this because...


Cheers, Fokko


Re: [VOTE] Release Apache PyIceberg 0.5.0 RC3

2023-09-14 Thread Fokko Driesprong
Hey everyone,

A small follow-up on how to easily run the checks:

curl https://dist.apache.org/repos/dist/dev/iceberg/KEYS -o KEYS
gpg --import KEYS

svn checkout
https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc3/
/tmp/pyiceberg/

for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
/tmp/pyiceberg/pyiceberg-*.tar.gz)
do
gpg --verify ${name}.asc ${name}
done

cd  /tmp/pyiceberg/
for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl.asc.sha512
/tmp/pyiceberg/pyiceberg-*.tar.gz.asc.sha512)
do
shasum -a 512 --check ${name}
done

tar xzf pyiceberg-0.5.0.tar.gz
cd pyiceberg-0.5.0

./dev/check-license

This will be part of the docs once released.

Kind regards,
Fokko



Op wo 13 sep 2023 om 14:18 schreef Fokko Driesprong :

> Hi Everyone,
>
>
> I propose that we release the following RC as the official PyIceberg 0.5.0
> release. This includes the performance issue that was discovered in RC2. A
> summary of what's included in 0.5.0:
>
>- Add gzip metadata support
><https://github.com/apache/iceberg/pull/7984>
>- PyArrow HDFS support <https://github.com/apache/iceberg/pull/7997>
>- Support serverless environments (AWS Lambda)
><https://github.com/apache/iceberg/pull/8061>
>- Many fixes around Avro performance (PRs 1
><https://github.com/apache/iceberg/pull/8074>, 2
><https://github.com/apache/iceberg/pull/8075>, 3
><https://github.com/apache/iceberg/pull/8082>, 4
><https://github.com/apache/iceberg/pull/8084>)
>- Remove the upper bound of PyParsing dependency
><https://github.com/apache/iceberg/pull/8116> (blocking a PR in Airflow
><https://github.com/apache/airflow/pull/32786>)
>- Moving the reading of Avro to Cython
><https://github.com/apache/iceberg/pull/8134> (10x speed
>improvement(!))
>- Support for the SQLCatalog
><https://github.com/apache/iceberg/pull/7921> (JDBC in Java)
>- Fix support for UUID columns
><https://github.com/apache/iceberg/pull/8267>
>- Support for adding columns
><https://github.com/apache/iceberg/pull/8174>
>- Optimize concurrency <https://github.com/apache/iceberg/pull/8104> 
> (follow
>up on the Support serverless environments)
>- Bump Pydantic to v2 <https://github.com/apache/iceberg/pull/7782> 
> (improved
>performance of the JSON (de)serialization)
>- A lot of bugfixes!
>
> The commit ID is f798b06246e67131d413dfceece5ccaf269e01fe
>
>
>- This corresponds to the tag: pyiceberg-0.5.0rc3
>(37fa779b0957644590a03754a733a5b3e3f589d0)
>- https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc3
>-
>
> https://github.com/apache/iceberg/tree/f798b06246e67131d413dfceece5ccaf269e01fe
>
> The release tarball, signature, and checksums are here:
>
>
>- https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc3/
>
> You can find the KEYS file here:
>
>
>- https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on pypi:
>
>
> https://pypi.org/project/pyiceberg/0.5.0rc3/
>
>
> And can be installed using: pip3 install pyiceberg==0.5.0rc3
>
>
> Please download, verify, and test.
>
>
> Please vote in the next 72 hours.
>
>
> [ ] +1 Release this as PyIceberg 0.5.0
>
> [ ] +0
>
> [ ] -1 Do not release this because...
>
>
> Cheers, Fokko
>
>
>


Re: [VOTE] Release Apache PyIceberg 0.5.0 RC3

2023-09-16 Thread Fokko Driesprong
Hey Ryan,

Thanks for catching that. It slipped in here
<https://github.com/apache/iceberg/commit/bf748dab6cf986f54f4272a6ff0d4adc1effd93a#diff-90e4425a7659a30dd1a8ac7ec38beab8d29ca79714584029b322d6b8077cad07>.
I made the fix part of PR #8504
<https://github.com/apache/iceberg/pull/8504/>, I think it would be good to
get that one in.

Cheers, Fokko

Op za 16 sep 2023 om 22:11 schreef Ryan Blue :

> -1 I went to verify the sha512 sum, but it is missing.
>
> If we can add the correct checksum, then I'll update my vote. Looks like
> the sha512 sum for the GPG signature file was added instead:
> pyiceberg-0.5.0.tar.gz*.asc*.sha512
>
> Everything else looks good.
>
> - Ran license checks
> - Verified signature
> - Ran tests
>
> On Thu, Sep 14, 2023 at 6:47 PM Xuanwo  wrote:
>
>> Hi,
>>
>> +1 (non-binding)
>>
>> I have checked:
>>
>> - signature that made by fo...@apache.org
>>
>> ```
>> :) for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
>> /tmp/pyiceberg/pyiceberg-*.tar.gz)
>> do
>> gpg --verify ${name}.asc ${name}
>> done
>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>> gpg:    using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>> gpg:    using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:38 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:    issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>> gpg:using RSA key FCD3779E399C53D995FC82A35171BA3E54493550
>> gpg:issuer "fo...@apache.org"
>> gpg: Good signature from "Fokko Driesprong " [ultimate]
>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>> gpg:using RSA

Re: [VOTE] Release Apache PyIceberg 0.5.0 RC3

2023-09-17 Thread Fokko Driesprong
I've corrected it for RC3 as well, so we don't need to send out another RC.

Cheers, Fokko

Op za 16 sep 2023 om 22:32 schreef Fokko Driesprong :

> Hey Ryan,
>
> Thanks for catching that. It slipped in here
> <https://github.com/apache/iceberg/commit/bf748dab6cf986f54f4272a6ff0d4adc1effd93a#diff-90e4425a7659a30dd1a8ac7ec38beab8d29ca79714584029b322d6b8077cad07>.
> I made the fix part of PR #8504
> <https://github.com/apache/iceberg/pull/8504/>, I think it would be good
> to get that one in.
>
> Cheers, Fokko
>
> Op za 16 sep 2023 om 22:11 schreef Ryan Blue :
>
>> -1 I went to verify the sha512 sum, but it is missing.
>>
>> If we can add the correct checksum, then I'll update my vote. Looks like
>> the sha512 sum for the GPG signature file was added instead:
>> pyiceberg-0.5.0.tar.gz*.asc*.sha512
>>
>> Everything else looks good.
>>
>> - Ran license checks
>> - Verified signature
>> - Ran tests
>>
>> On Thu, Sep 14, 2023 at 6:47 PM Xuanwo  wrote:
>>
>>> Hi,
>>>
>>> +1 (non-binding)
>>>
>>> I have checked:
>>>
>>> - signature that made by fo...@apache.org
>>>
>>> ```
>>> :) for name in $(ls /tmp/pyiceberg/pyiceberg-*.whl
>>> /tmp/pyiceberg/pyiceberg-*.tar.gz)
>>> do
>>> gpg --verify ${name}.asc ${name}
>>> done
>>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:40 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>>> gpg:    using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:41 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:    issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:38 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apache.org"
>>> gpg: Good signature from "Fokko Driesprong "
>>> [ultimate]
>>> gpg: Signature made Wed 13 Sep 2023 08:07:39 PM CST
>>> gpg:using RSA key
>>> FCD3779E399C53D995FC82A35171BA3E54493550
>>> gpg:issuer "fo...@apach

[ANNOUNCE] PyIceberg 0.5.0

2023-09-18 Thread Fokko Driesprong
I'm pleased to announce the release of Apache PyIceberg 0.5.0!

PyIceberg 0.5.0 comes with many new features:

   - Add gzip metadata support 
   - PyArrow HDFS support 
   - Support serverless environments (AWS Lambda)
   
   - Many fixes around Avro performance (PRs 1
   , 2
   , 3
   , 4
   )
   - Remove the upper bound of PyParsing dependency
    (blocking a PR in Airflow
   )
   - Moving the reading of Avro to Cython
    (10x speed improvement(!))
   - Support for the SQLCatalog
    (JDBC in Java)
   - Fix support for UUID columns
   
   - Support for adding columns
   
   - Optimize concurrency  (follow
   up on the Support serverless environments)
   - Bump Pydantic to v2 
(improved
   performance of the JSON (de)serialization)
   - A lot of bugfixes!

The docs have been updated  to the latest
release.

Apache Iceberg is an open table format for huge analytic datasets. Iceberg
delivers high query performance for tables with tens of petabytes of data,
along with atomic commits, concurrent writes, and SQL-compatible table
evolution.

This Python release can be downloaded from:
https://pypi.org/project/pyiceberg/0.5.0/

If you have any questions or run into anything, feel free to reach out to
the #python channel on Slack

or open an issue  on GitHub.

Thanks to everyone for contributing!


Re: [VOTE] Release Apache PyIceberg 0.5.0 RC3

2023-09-18 Thread Fokko Driesprong
Thanks everyone,

Concluding the vote:

+1 by:
- Xuanwo
- Ryan Blue (binding)
- Jean-Baptiste Onofré
- Daniel Weeks (binding)
- Fokko (binding)

Thanks everyone for testing, appreciate it! The release is already out.

Kind regards, Fokko




Op ma 18 sep 2023 om 19:01 schreef Daniel Weeks :

> +1 (binding)
>
> Verified sigs/sums/license/tests
>
> Also tested some schema evolution and various other local tests and
> everything worked great.
>
> -Dan
>
> On Mon, Sep 18, 2023 at 8:15 AM Ryan Blue  wrote:
>
>> Changing my vote to +1. Thanks for fixing the checksum, Fokko!
>>
>> On Mon, Sep 18, 2023 at 8:04 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> quickly tested the "legal" part:
>>> - signatures/hash has been fixed thanks for that !
>>> - ASF header looks ok
>>> - no binaries found in the pyiceberg distribution which is good
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Wed, Sep 13, 2023 at 2:18 PM Fokko Driesprong 
>>> wrote:
>>> >
>>> > Hi Everyone,
>>> >
>>> >
>>> > I propose that we release the following RC as the official PyIceberg
>>> 0.5.0 release. This includes the performance issue that was discovered in
>>> RC2. A summary of what's included in 0.5.0:
>>> >
>>> > Add gzip metadata support
>>> > PyArrow HDFS support
>>> > Support serverless environments (AWS Lambda)
>>> > Many fixes around Avro performance (PRs 1, 2, 3, 4)
>>> > Remove the upper bound of PyParsing dependency (blocking a PR in
>>> Airflow)
>>> > Moving the reading of Avro to Cython (10x speed improvement(!))
>>> > Support for the SQLCatalog (JDBC in Java)
>>> > Fix support for UUID columns
>>> > Support for adding columns
>>> > Optimize concurrency (follow up on the Support serverless environments)
>>> > Bump Pydantic to v2 (improved performance of the JSON
>>> (de)serialization)
>>> > A lot of bugfixes!
>>> >
>>> > The commit ID is f798b06246e67131d413dfceece5ccaf269e01fe
>>> >
>>> > This corresponds to the tag: pyiceberg-0.5.0rc3
>>> (37fa779b0957644590a03754a733a5b3e3f589d0)
>>> > https://github.com/apache/iceberg/releases/tag/pyiceberg-0.5.0rc3
>>> >
>>> https://github.com/apache/iceberg/tree/f798b06246e67131d413dfceece5ccaf269e01fe
>>> >
>>> > The release tarball, signature, and checksums are here:
>>> >
>>> > https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.0rc3/
>>> >
>>> > You can find the KEYS file here:
>>> >
>>> > https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>> >
>>> > Convenience binary artifacts are staged on pypi:
>>> >
>>> >
>>> > https://pypi.org/project/pyiceberg/0.5.0rc3/
>>> >
>>> >
>>> > And can be installed using: pip3 install pyiceberg==0.5.0rc3
>>> >
>>> >
>>> > Please download, verify, and test.
>>> >
>>> >
>>> > Please vote in the next 72 hours.
>>> >
>>> >
>>> > [ ] +1 Release this as PyIceberg 0.5.0
>>> >
>>> > [ ] +0
>>> >
>>> > [ ] -1 Do not release this because...
>>> >
>>> >
>>> > Cheers, Fokko
>>> >
>>> >
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: [VOTE] Release Apache Iceberg 1.4.0 RC1

2023-09-29 Thread Fokko Driesprong
+1 (binding)

Thanks Anton for running the release and everyone who contributed! Checks I
did:

   - Updated the docker-spark-iceberg repo
   , and
   everything runs fine (still with Spark 3.4 since there were some problems
   with Jupyters' Scala 2.13 kernel). This includes new new aws-bundle 🥳
   - Tested against Trino ,
   and found three differences, but expected:
  - More defensive cleaning up of files on a failed commit, to make
  table recovery easier when needed.
  - A new property that's set on the table, indicating zstd compression.
  - Changes in the exceptions when binding a transform to a column type
  that is not allowed.

Kind regards, Fokko


Op vr 29 sep 2023 om 07:35 schreef Jean-Baptiste Onofré :

> +1 (non binding)
>
> I checked:
> - signatures and hash are ok
> - asf headers are present
> - no binary in the source distribution
> - build is ok
>
> NB: I’m working on a set of use cases with different data sets but it’s
> not yet complete. I should have it for next release and be able to compare
> queries time and behavior between releases.
>
> Thanks !
> Regards
> JB
>
> Le jeu. 28 sept. 2023 à 04:02, Anton Okolnychyi
>  a écrit :
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official Apache Iceberg
>> 1.4.0 release.
>>
>> The commit ID is 8f37faa6a21e863551b17992370edc0f8706465d
>> * This corresponds to the tag: apache-iceberg-1.4.0-rc1
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.4.0-rc1
>> *
>> https://github.com/apache/iceberg/tree/8f37faa6a21e863551b17992370edc0f8706465d
>>
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.4.0-rc1
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1145/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours. (Weekends excluded)
>>
>> [ ] +1 Release this as Apache Iceberg 1.4.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Only PMC members have binding votes, but other community members are
>> encouraged to cast non-binding votes. This vote will pass if there are 3
>> binding +1 votes and more binding +1 votes than -1 votes.
>>
>> - Anton
>
>


Migration of PyIceberg to iceberg-python repository

2023-09-29 Thread Fokko Driesprong
Hey everyone 👋

A while ago we discussed that Rust and Go are going into a separate
repository: https://lists.apache.org/thread/4s02lmwf1kyrxxdpj3q9w2fqnxq2llbn

Since we just did the PyIcerg 0.5.0 release, I think it is a good moment to
migrate PyIceberg to iceberg-python as well:
 https://github.com/apache/iceberg-python/pull/2
<https://github.com/apache/iceberg-python/pull/2> I went over the PRs that
are ready to merge and got them in. If there is anything missing, please
let me know.

I would suggest merging the PR and leaving the source code in the main
repository for another week or so to make sure that we didn't miss anything.

Since PyIceberg now also hosts the docs on the Github pages of the Iceberg
repository, moving PyIceberg will also free up the Github pages for the
migration of the docs back into the main repository.

Let me know if there are any concerns.

Kind regards,
Fokko Driesprong


Re: Migration of PyIceberg to iceberg-python repository

2023-09-29 Thread Fokko Driesprong
Hey Ajantha,

That's a great suggestion. I've followed the steps and created a new PR
here: https://github.com/apache/iceberg-python/pull/3

The subdirectory-filter command moves a subdirectory to the root directory.
This way I still had to add some files afterward (.github/*, .gitignore,
etc.), these are in a separate commit. Please take a look.

Thanks,

Fokko

Op vr 29 sep 2023 om 13:39 schreef Ajantha Bhat :

> I think we are gonna lose the history of commits if we merge the above PR.
>
> There are ways to move the subfolder into a new repo by retaining commit
> history.
> For example:
> -
> https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b
>
> - https://gist.github.com/trongthanh/2779392
>
> Please give it a try.
>
> Thanks,
> Ajantha
>
> On Fri, Sep 29, 2023 at 4:55 PM Fokko Driesprong  wrote:
>
>> Hey everyone 👋
>>
>> A while ago we discussed that Rust and Go are going into a separate
>> repository:
>> https://lists.apache.org/thread/4s02lmwf1kyrxxdpj3q9w2fqnxq2llbn
>>
>> Since we just did the PyIcerg 0.5.0 release, I think it is a good moment
>> to migrate PyIceberg to iceberg-python as well:
>>  https://github.com/apache/iceberg-python/pull/2
>> <https://github.com/apache/iceberg-python/pull/2> I went over the PRs
>> that are ready to merge and got them in. If there is anything missing,
>> please let me know.
>>
>> I would suggest merging the PR and leaving the source code in the main
>> repository for another week or so to make sure that we didn't miss anything.
>>
>> Since PyIceberg now also hosts the docs on the Github pages of the
>> Iceberg repository, moving PyIceberg will also free up the Github pages for
>> the migration of the docs back into the main repository.
>>
>> Let me know if there are any concerns.
>>
>> Kind regards,
>> Fokko Driesprong
>>
>


Re: Migration of PyIceberg to iceberg-python repository

2023-09-30 Thread Fokko Driesprong
e client
>>> side.
>>>
>>> On Fri, Sep 29, 2023 at 9:39 PM Brian Olsen 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Great work Fokko!
>>>>
>>>> Pucheng,
>>>>
>>>> We still want to maintain all of the issues in the Python repository.
>>>> The one thing we will lose is pull requests, but I assume there are very
>>>> few.
>>>>
>>>> On Fri, Sep 29, 2023 at 10:34 AM Pucheng Yang
>>>>  wrote:
>>>>
>>>>> Thanks for doing this. I wonder how do we deal with all the issues
>>>>> filed for python module but still open in iceberg repo?
>>>>>
>>>>> On Fri, Sep 29, 2023 at 7:55 AM Eduard Tudenhoefner 
>>>>> wrote:
>>>>>
>>>>>> +1 on moving to a separate repo and maintaining git history
>>>>>>
>>>>>> On Fri, Sep 29, 2023 at 3:30 PM Jean-Baptiste Onofré 
>>>>>> wrote:
>>>>>>
>>>>>>> Awesome, it looks even better ;)
>>>>>>>
>>>>>>> Thanks !
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Fri, Sep 29, 2023 at 2:31 PM Fokko Driesprong 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hey Ajantha,
>>>>>>> >
>>>>>>> > That's a great suggestion. I've followed the steps and created a
>>>>>>> new PR here: https://github.com/apache/iceberg-python/pull/3
>>>>>>> >
>>>>>>> > The subdirectory-filter command moves a subdirectory to the root
>>>>>>> directory. This way I still had to add some files afterward (.github/*,
>>>>>>> .gitignore, etc.), these are in a separate commit. Please take a look.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> >
>>>>>>> > Fokko
>>>>>>> >
>>>>>>> > Op vr 29 sep 2023 om 13:39 schreef Ajantha Bhat <
>>>>>>> ajanthab...@gmail.com>:
>>>>>>> >>
>>>>>>> >> I think we are gonna lose the history of commits if we merge the
>>>>>>> above PR.
>>>>>>> >>
>>>>>>> >> There are ways to move the subfolder into a new repo by retaining
>>>>>>> commit history.
>>>>>>> >> For example:
>>>>>>> >> -
>>>>>>> https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b
>>>>>>> >> - https://gist.github.com/trongthanh/2779392
>>>>>>> >>
>>>>>>> >> Please give it a try.
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> Ajantha
>>>>>>> >>
>>>>>>> >> On Fri, Sep 29, 2023 at 4:55 PM Fokko Driesprong <
>>>>>>> fo...@apache.org> wrote:
>>>>>>> >>>
>>>>>>> >>> Hey everyone 👋
>>>>>>> >>>
>>>>>>> >>> A while ago we discussed that Rust and Go are going into a
>>>>>>> separate repository:
>>>>>>> https://lists.apache.org/thread/4s02lmwf1kyrxxdpj3q9w2fqnxq2llbn
>>>>>>> >>>
>>>>>>> >>> Since we just did the PyIcerg 0.5.0 release, I think it is a
>>>>>>> good moment to migrate PyIceberg to iceberg-python as well:
>>>>>>> https://github.com/apache/iceberg-python/pull/2 I went over the PRs
>>>>>>> that are ready to merge and got them in. If there is anything missing,
>>>>>>> please let me know.
>>>>>>> >>>
>>>>>>> >>> I would suggest merging the PR and leaving the source code in
>>>>>>> the main repository for another week or so to make sure that we didn't 
>>>>>>> miss
>>>>>>> anything.
>>>>>>> >>>
>>>>>>> >>> Since PyIceberg now also hosts the docs on the Github pages of
>>>>>>> the Iceberg repository, moving PyIceberg will also free up the Github 
>>>>>>> pages
>>>>>>> for the migration of the docs back into the main repository.
>>>>>>> >>>
>>>>>>> >>> Let me know if there are any concerns.
>>>>>>> >>>
>>>>>>> >>> Kind regards,
>>>>>>> >>> Fokko Driesprong
>>>>>>>
>>>>>>


Re: Migration of PyIceberg to iceberg-python repository

2023-10-02 Thread Fokko Driesprong
Hey everyone,

Update from my side. I've moved all the issues
<https://github.com/apache/iceberg-python/issues> and my PRs
<https://github.com/apache/iceberg-python/pulls>. Not all issues needed to
be migrated since a lot of them were already fixed. I've closed the
remaining PRs that were still open, those are either abandoned, failed on
CI, or had changes pending. Of course, with the kind request to re-open
them to the iceberg-python repository.

Ajantha already created a PR <https://github.com/apache/iceberg/pull/8695>
(thanks for that!) to remove Python from the iceberg repo.

Kind regards, Fokko


Op za 30 sep 2023 om 21:06 schreef Fokko Driesprong :

> Hey everyone,
>
> Pucheng: I wonder how do we deal with all the issues filed for python
>> module but still open in iceberg repo?
>
>
> That's a good point. I think we should migrate them. I checked and it is
> only 3 pages
> <https://github.com/apache/iceberg/issues?q=is%3Aissue+is%3Aopen+python>.
> Likely a few more if we query on other keywords. I think migrating them by
> hand is feasible. It also gives us a chance to clean them up (all the
> issues on the last page I linked above are not relevant anymore, and can be
> closed).
>
> Brian: The one thing we will lose is pull requests, but I assume there are
>> very few.
>
>
> I've checked those as well, and as Brian already mentioned, there are just
> a few
> <https://github.com/apache/iceberg/pulls?q=is%3Apr+is%3Aopen+label%3Apython>.
> There is never a perfect moment since there are always PRs open that will
> break, but just after the release I think is the best worst moment :) The
> PRs that are open are trivial to move to the new repo as well.
>
> Hussain: I checked the discussion thread, and one of the motivations for
>> this separation was to avoid triggering unrelated CI jobs after each
>> change. However, I wonder if it isn't (and will not be) necessary to check
>> the compatibility between the main repository and the client after each
>> change. Otherwise, we will need to trigger the CI across the different
>> repositories using the GHA API, not necessarily to block the PR, but just
>> to give quick feedback and notification that something needs to be changed
>> on the client side.
>
>
> Checking between dev versions is not something we do today, and PyIceberg
> lives isolated in the main repository. We might want to do some integration
> tests at some point, but I'm not sure if we should start testing dev
> versions against each other. The main issue with triggering the CI is to
> not exponentially explode the ignore list
> <https://github.com/apache/iceberg/blob/master/.github/workflows/flink-ci.yml#L20-L51>
> of a Github action. An example here
> <https://github.com/apache/iceberg/pull/8546#issuecomment-1712958280> is
> where the Python GA file was not properly excluded.
>
> I would much rather rely on some reference tests that Jean-Baptiste
> mentioned at the Java Iceberg 1.4.0 release, and that we're also working on
> at Tabular (disclaimer: I'm working for Tabular). Python i inspired by
> Java, and we've recently uncovered some issues
> <https://github.com/apache/iceberg/pull/8673> (thanks Jan Finis!) with
> respect to adhering to the spec, so I think a strict approach to validate
> the implementations would be preferred.
>
> That said, in PyIceberg we use Spark (which uses the Java library) to run
> integration tests. This is based on the released versions which works very
> well. Not sure if we should create matrices between
> Python/Go/Rust/Iceberg/Athena/Snowflake/... (you're seeing where this is
> going) :) But these are just my thoughts today and might change in the
> future.
>
> Thanks everyone, I'll go ahead and merge the PR that includes the history.
>
> Cheers, Fokko
>
> Ps. The repo might look a bit funky, but that's because I've created the
> pr-branch before the main branch. I didn't know that the branch that was
> created first, would be promoted to the default branch. I'm working with 
> Apache
> Infra <https://issues.apache.org/jira/browse/INFRA-25029> to get it fixed.
>
> Op za 30 sep 2023 om 20:29 schreef Daniel Weeks :
>
>> +1 to relocate with history.
>>
>> On Sat, Sep 30, 2023, 10:24 AM Brian Olsen 
>> wrote:
>>
>>> This shouldn’t be too hard and can likely be a nightly build that occurs
>>> with each client repository.
>>>
>>> We’re already planning on doing the documentation using git submodule to
>>> pull all the documentation under a single build in the central repo. We can
>>> likely go the other direction

Re: [DISCUSSION] Rename master branch as main for the main repository

2023-10-02 Thread Fokko Driesprong
Big +1!

Thanks for raising this JB!

Kind regards,
Fokko

Op di 3 okt 2023 om 07:56 schreef Jean-Baptiste Onofré :

> Thanks all for your feedback.
>
> I will prepare the renaming then, I will keep you posted.
>
> Regards
> JB
>
> On Tue, Oct 3, 2023 at 2:36 AM Renjie Liu  wrote:
> >
> > +1
> >
> > Sent from my iPhone
> >
> > On Oct 3, 2023, at 08:18, John Zhuge  wrote:
> >
> > 
> > +1
> >
> > On Mon, Oct 2, 2023 at 2:48 PM Brian Olsen 
> wrote:
> >>
> >> As with any of these changes, the one and only inescapable side-effect
> is that users' local environments will not be able to be updated. GitHub
> has otherwise made it very simple to rename branches to accommodate this
> use case. https://github.com/github/renaming Any old references to master
> will on the GitHub site itself will reroute to main.
> >>
> >> It's a small annoyance to make the Iceberg community more inclusive.
> For those that aren't aware of the why:
> https://en.wikipedia.org/wiki/Master/slave_(technology)#Terminology_concerns
> .
> >>
> >> On Mon, Oct 2, 2023 at 4:34 PM Hussein Awala  wrote:
> >>>
> >>> +1
> >>>
> >>> On Mon, Oct 2, 2023 at 11:27 PM Anton Okolnychyi <
> aokolnyc...@apache.org> wrote:
> 
>  +1
> 
>  On 2023/10/02 20:12:37 Bryan Keller wrote:
>  > Hearty +1 from me
>  >
>  >
>  >
>  > > On Sep 29, 2023, at 5:37 AM, Brian Olsen 
> wrote:
>  > >
>  > >
>  >
>  > > 
>  > >
>  > > +1000
>  > >
>  > >
>  > >
>  > >
>  > > Let me know how I can help!
>  > >
>  > >
>  > >
>  > >
>  > > On Fri, Sep 29, 2023 at 7:35 AM Jean-Baptiste Onofré
>  > > <[j...@nanthrax.net](mailto:j...@nanthrax.net)> wrote:
>  > >
>  > >
>  >
>  > >> Hi guys,
>  > >
>  > >  The Apache CoC (<
> https://www.apache.org/foundation/policies/conduct>)
>  > >  especially contains section 5 about the wording we use. Several
> Apache
>  > >  projects renamed the master branch to the main branch (Apache
> Karaf,
>  > >  ActiveMQ, Airflow, ...).
>  > >  As we already use main for go, rust, and python repositories, I
> wonder
>  > >  (for consistency) if we should not rename master to main on the
> "main"
>  > >  repository.
>  > >
>  > >  Apache INFRA can do this "smoothly" but we would have to do some
> changes:
>  > >  \- update build.gradle
>  > >  \- update README.md
>  > >  \- update to GH Actions (in .github/workflows/*)
>  > >
>  > >  Thoughts ?
>  > >
>  > >  Regards
>  > >  JB
>  > >
>  >
>  >
> >
> >
> >
> > --
> > John Zhuge
>


Re: [PROPOSAL] Regular release pace & some post release actions

2023-10-07 Thread Fokko Driesprong
My 2ct,

There is no harm in stating it explicitly, however, I'm not in favor of
making it so explicit by pinning a date onto it (Jan 24). I would rather
say that releases can be expected at least every quarter (so it doesn't
need to be updated :)

I noticed that the releases of Iceberg are also driven by the release
cadence query engines. Once there is a new Flink or Spark release, an
Iceberg release follows. I like to add *at least* because I see an uptake
in the activity and I think want to release as often as possible, without
introducing too much pressure on testing.

Cheers, Fokko



Op za 7 okt 2023 om 18:52 schreef Jean-Baptiste Onofré :

> Yes, agree. Patch release is whenever needed.
> The pace is more for « feature releases » and also the information on
> website.
>
> Regards
> JB
>
> Le sam. 7 oct. 2023 à 11:41, Renjie Liu  a
> écrit :
>
>> I think there are two kinds of releases:
>> 1. Feature release. That means to upgrade the minor part of the version
>> number, e.g. 1.4.0, 1.5.0, etc.
>> 2. Patch release. That's bug fixes to minor releases, which upgrades to
>> the last part of each release version, e.g. 1.4.1, 1.4.2.
>>
>> I think the quarterly release should be applied to feature release, while
>> patch release should be more frequent to fix bugs.
>>
>> On Sat, Oct 7, 2023 at 3:20 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Just to be concrete about "regular & predictable releases pace", the
>>> proposal is to have one line on https://iceberg.apache.org/releases/
>>> like this:
>>>
>>> "Apache Iceberg releases are expected every quarter. Next target
>>> release is 1.4.1 planned on Jan 24."
>>>
>>> To be honest, only a few Apache projects do that (Karaf, Camel,
>>> ActiveMQ, Subversion, ...), I like this to give "vision" to the
>>> community :)
>>>
>>> Regards
>>> JB
>>>
>>> On Sat, Oct 7, 2023 at 6:59 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >
>>> > Hi Ryan,
>>> >
>>> > For the pace, yes, it's what I saw with the previous release date. My
>>> > proposal is to clearly state that on website (on release page),
>>> > something like "We target a release per quarter". Just to inform the
>>> > community.
>>> >
>>> > About the other points:
>>> > 2.1. Great, thanks !
>>> > 2.2. Yes, release notes on releases page are fine. The proposal is
>>> > more to have some details about specific highlight points, with
>>> > examples for instance. Something like
>>> >
>>> http://nanthrax.blogspot.com/2022/04/apache-karaf-runtime-440-has-been.html
>>> .
>>> > It's a bit long for a release notes page, so it could be "linked" on
>>> > release notes page. About your point, I agree, but we already have
>>> > https://iceberg.apache.org/blogs/ with posts from different people.
>>> > How do we choose the blog posts here ? I guess these blog posts have
>>> > been submitted as PR and reviewed/merged. Maybe we can use the same
>>> > for release highlights ?
>>> > 2.3. The cleanup should be done as soon as a new release is uploaded
>>> > to dist.apache.org (for instance, we still have Iceberg 0.14.1 on
>>> > https://dist.apache.org/repos/dist/release/iceberg/). The tags cleanup
>>> > is up to us, but for dist, ASF INFRA asks for cleanup (we should have
>>> > only the latest release on dist.apache.org) to limit the space use.
>>> > 2.4. Cool, thanks ! I'm updating the PR with the DOAP.
>>> >
>>> > Thanks again ! Much appreciated :)
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On Fri, Oct 6, 2023 at 8:34 PM Ryan Blue  wrote:
>>> > >
>>> > > The Iceberg community has already established a regular release
>>> cadence, which is once per quarter. Here's the recent release history,
>>> minus patch releases:
>>> > >
>>> > > - 1.4.0: 2023-10-04
>>> > > - 1.3.0: 2023-05-26
>>> > > - 1.2.0: 2023-03-20
>>> > > - 1.1.0: 2022-11-29
>>> > > - 1.0.0: 2022-10-14
>>> > > - 0.14.0: 2022-07-16
>>> > >
>>> > > As you can see, we've generally met the target, so I'm not sure why
>>> you're suggesting a change.
>>> > >
>>> > > If your aim is for more strict adherence to the quarterly release
>>> target, I don't think that's a good idea. I think I've mentioned this
>>> before, but I think we want to avoid strict policies that inhibit our
>>> ability to make reasonable decisions as a community, as was the case here
>>> to get Spark 3.5 out as soon as possible.
>>> > >
>>> > > For your other suggestions:
>>> > > 2.1. Sure, let's send announcements to the announce list. Note that
>>> this has to happen after the website is updated, which causes delays right
>>> now. We're working on fixing this.
>>> > > 2.2. I don't think it is a good idea for the project to host blog
>>> posts because it puts the community in a very awkward position of choosing
>>> who can post and what content can be there. And I think what you're asking
>>> for is release notes, which we do post on the releases page. If you'd like
>>> to help make these better, please do! We always need help translating from
>>> PR descriptions to release notes that help people understand what is
>>> 

Re: [PROPOSAL] Regular release pace & some post release actions

2023-10-07 Thread Fokko Driesprong
There is also one post-upgrade that I would like to propose:
https://github.com/apache/iceberg-docs/pull/280 We publish the releases
also at Github <https://github.com/apache/iceberg/releases>, and I think
that also gives a nice changelog. Now Python is in its own repository, no
need to clean up the PyIceberg related pull-requests.

Kind regards, Fokko

Op za 7 okt 2023 om 20:31 schreef Daniel Weeks :

> I would agree with Fokko here.  We want flexibility with releases and
> tracking to specific dates on the website just lends to unnecessary process.
>
> We also tend to track the release progress using github milestones
> especially as we get closer to the release date, which provides more
> context.  Tracking in multiple places just leads to inconsistency.
>
> -Dan
>
>
>
> On Sat, Oct 7, 2023 at 11:09 AM Fokko Driesprong  wrote:
>
>> My 2ct,
>>
>> There is no harm in stating it explicitly, however, I'm not in favor of
>> making it so explicit by pinning a date onto it (Jan 24). I would rather
>> say that releases can be expected at least every quarter (so it doesn't
>> need to be updated :)
>>
>> I noticed that the releases of Iceberg are also driven by the release
>> cadence query engines. Once there is a new Flink or Spark release, an
>> Iceberg release follows. I like to add *at least* because I see an
>> uptake in the activity and I think want to release as often as possible,
>> without introducing too much pressure on testing.
>>
>> Cheers, Fokko
>>
>>
>>
>> Op za 7 okt 2023 om 18:52 schreef Jean-Baptiste Onofré :
>>
>>> Yes, agree. Patch release is whenever needed.
>>> The pace is more for « feature releases » and also the information on
>>> website.
>>>
>>> Regards
>>> JB
>>>
>>> Le sam. 7 oct. 2023 à 11:41, Renjie Liu  a
>>> écrit :
>>>
>>>> I think there are two kinds of releases:
>>>> 1. Feature release. That means to upgrade the minor part of the version
>>>> number, e.g. 1.4.0, 1.5.0, etc.
>>>> 2. Patch release. That's bug fixes to minor releases, which upgrades to
>>>> the last part of each release version, e.g. 1.4.1, 1.4.2.
>>>>
>>>> I think the quarterly release should be applied to feature release,
>>>> while patch release should be more frequent to fix bugs.
>>>>
>>>> On Sat, Oct 7, 2023 at 3:20 PM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>>> Just to be concrete about "regular & predictable releases pace", the
>>>>> proposal is to have one line on https://iceberg.apache.org/releases/
>>>>> like this:
>>>>>
>>>>> "Apache Iceberg releases are expected every quarter. Next target
>>>>> release is 1.4.1 planned on Jan 24."
>>>>>
>>>>> To be honest, only a few Apache projects do that (Karaf, Camel,
>>>>> ActiveMQ, Subversion, ...), I like this to give "vision" to the
>>>>> community :)
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Sat, Oct 7, 2023 at 6:59 AM Jean-Baptiste Onofré 
>>>>> wrote:
>>>>> >
>>>>> > Hi Ryan,
>>>>> >
>>>>> > For the pace, yes, it's what I saw with the previous release date. My
>>>>> > proposal is to clearly state that on website (on release page),
>>>>> > something like "We target a release per quarter". Just to inform the
>>>>> > community.
>>>>> >
>>>>> > About the other points:
>>>>> > 2.1. Great, thanks !
>>>>> > 2.2. Yes, release notes on releases page are fine. The proposal is
>>>>> > more to have some details about specific highlight points, with
>>>>> > examples for instance. Something like
>>>>> >
>>>>> http://nanthrax.blogspot.com/2022/04/apache-karaf-runtime-440-has-been.html
>>>>> .
>>>>> > It's a bit long for a release notes page, so it could be "linked" on
>>>>> > release notes page. About your point, I agree, but we already have
>>>>> > https://iceberg.apache.org/blogs/ with posts from different people.
>>>>> > How do we choose the blog posts here ? I guess these blog posts have
>>>>> > been submitted as PR and reviewed/merged. Maybe we can use the same
>>>>> > for rel

Re: Migration of PyIceberg to iceberg-python repository

2023-10-08 Thread Fokko Driesprong
Hey everyone,

It has been a week since PyIceberg migrated to its own repository. Should
we move forward by removing the Python codebase from the main repository?
Ajantha already raised a pull-request
<https://github.com/apache/iceberg/pull/8695> to do this (thank you for
that 🙌).

Kind regards,
Fokko

Op ma 2 okt 2023 om 16:16 schreef Fokko Driesprong :

> Hey everyone,
>
> Update from my side. I've moved all the issues
> <https://github.com/apache/iceberg-python/issues> and my PRs
> <https://github.com/apache/iceberg-python/pulls>. Not all issues needed
> to be migrated since a lot of them were already fixed. I've closed the
> remaining PRs that were still open, those are either abandoned, failed on
> CI, or had changes pending. Of course, with the kind request to re-open
> them to the iceberg-python repository.
>
> Ajantha already created a PR <https://github.com/apache/iceberg/pull/8695>
> (thanks for that!) to remove Python from the iceberg repo.
>
> Kind regards, Fokko
>
>
> Op za 30 sep 2023 om 21:06 schreef Fokko Driesprong :
>
>> Hey everyone,
>>
>> Pucheng: I wonder how do we deal with all the issues filed for python
>>> module but still open in iceberg repo?
>>
>>
>> That's a good point. I think we should migrate them. I checked and it is
>> only 3 pages
>> <https://github.com/apache/iceberg/issues?q=is%3Aissue+is%3Aopen+python>.
>> Likely a few more if we query on other keywords. I think migrating them by
>> hand is feasible. It also gives us a chance to clean them up (all the
>> issues on the last page I linked above are not relevant anymore, and can be
>> closed).
>>
>> Brian: The one thing we will lose is pull requests, but I assume there
>>> are very few.
>>
>>
>> I've checked those as well, and as Brian already mentioned, there are just
>> a few
>> <https://github.com/apache/iceberg/pulls?q=is%3Apr+is%3Aopen+label%3Apython>.
>> There is never a perfect moment since there are always PRs open that will
>> break, but just after the release I think is the best worst moment :) The
>> PRs that are open are trivial to move to the new repo as well.
>>
>> Hussain: I checked the discussion thread, and one of the motivations for
>>> this separation was to avoid triggering unrelated CI jobs after each
>>> change. However, I wonder if it isn't (and will not be) necessary to check
>>> the compatibility between the main repository and the client after each
>>> change. Otherwise, we will need to trigger the CI across the different
>>> repositories using the GHA API, not necessarily to block the PR, but just
>>> to give quick feedback and notification that something needs to be changed
>>> on the client side.
>>
>>
>> Checking between dev versions is not something we do today, and PyIceberg
>> lives isolated in the main repository. We might want to do some integration
>> tests at some point, but I'm not sure if we should start testing dev
>> versions against each other. The main issue with triggering the CI is to
>> not exponentially explode the ignore list
>> <https://github.com/apache/iceberg/blob/master/.github/workflows/flink-ci.yml#L20-L51>
>> of a Github action. An example here
>> <https://github.com/apache/iceberg/pull/8546#issuecomment-1712958280> is
>> where the Python GA file was not properly excluded.
>>
>> I would much rather rely on some reference tests that Jean-Baptiste
>> mentioned at the Java Iceberg 1.4.0 release, and that we're also working on
>> at Tabular (disclaimer: I'm working for Tabular). Python i inspired by
>> Java, and we've recently uncovered some issues
>> <https://github.com/apache/iceberg/pull/8673> (thanks Jan Finis!) with
>> respect to adhering to the spec, so I think a strict approach to validate
>> the implementations would be preferred.
>>
>> That said, in PyIceberg we use Spark (which uses the Java library) to run
>> integration tests. This is based on the released versions which works very
>> well. Not sure if we should create matrices between
>> Python/Go/Rust/Iceberg/Athena/Snowflake/... (you're seeing where this is
>> going) :) But these are just my thoughts today and might change in the
>> future.
>>
>> Thanks everyone, I'll go ahead and merge the PR that includes the history.
>>
>> Cheers, Fokko
>>
>> Ps. The repo might look a bit funky, but that's because I've created the
>> pr-branch before the main branch. I didn't know that the branch that was
>> created fir

PyIceberg 0.5.1 patch release

2023-10-15 Thread Fokko Driesprong
Hey everyone,

This week we've discovered a serious bug when parsing SQL-like string
expressions  (thanks Pucheng
for reporting this)! Ryan suggested doing a quick patch release to get this
fix out to the public ASAP, and I think that's a great idea. Since it is
just a patch release, I'm happy to run it. I've created the
pyiceberg-0.5.x branch
, and created
a milestone . I've
cherry-picked 8 PRs to this branch that don't introduce any behavioral
changes. Let me know if anything is missing.

There is one PR that needs to get in
 before I can start the
release process.

Cheers, Fokko


Re: Iceberg Slack invite

2023-10-16 Thread Fokko Driesprong
Hey Lin,

Can you try this link:
https://join.slack.com/t/apache-iceberg/shared_invite/zt-2561tq9qr-UtISlHgsdY3Virs3Z2_btQ
The
link you mentioned is working for me, but I'm already part of the
workspace. Can you share the error that you're seeing?

Kind regards,
Fokko

Op ma 16 okt 2023 om 22:39 schreef Lin, Jon :

> I’m evaluating Iceberg for my team’s transition to Open Table Formats.
>
>
>
> Is it possible to get access to the slack workspace?  The invite link
> 
> wouldn’t let me signup with my gmail account.
>


[VOTE] Release Apache PyIceberg 0.5.1 (RC1)

2023-10-16 Thread Fokko Driesprong
Hi Everyone,


I propose that we release the following RC as the official PyIceberg 0.5.1
release.


This is a patch release due to a bug that has been found
. Smaller bugs also have
been backported
.


The commit ID is ea9da8856a686eaeda0d5c2be78d5e3102b67c44


* This corresponds to the tag: pyiceberg-0.5.1rc1
(320b0f499d14178210c3b9cb7d94dab1e1b149e6)

* https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.5.1rc1

* https://github.com/apache/iceberg-python
/tree/ea9da8856a686eaeda0d5c2be78d5e3102b67c44



The release tarball, signature, and checksums are here:


* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.1rc1/


You can find the KEYS file here:


* https://dist.apache.org/repos/dist/dev/iceberg/KEYS


Convenience binary artifacts are staged on pypi:


https://pypi.org/project/pyiceberg/0.5.1rc1/


And can be installed using: pip3 install pyiceberg==0.5.1rc1


Please download, verify, and test.


Please vote in the next 72 hours.

[ ] +1 Release this as PyIceberg 0.5.1

[ ] +0

[ ] -1 Do not release this because...


Kind regards,

Fokko


Re: [VOTE] Release Apache PyIceberg 0.5.1 (RC1)

2023-10-19 Thread Fokko Driesprong
+1 (binding) from me as well.

I ran the example notebooks against the REST catalog
<https://github.com/tabular-io/docker-spark-iceberg/pull/108>.

Cheers, Fokko

Op do 19 okt 2023 om 02:37 schreef Rushan Jiang :

> +1 (non-binding)
>
> - Verified signatures and checksums
> - Verified license
> - Ran unit tests and integration tests via make test-coverage
>
> Thanks,
> Jonas
>
>
> > On Oct 16, 2023, at 23:14, Jean-Baptiste Onofré  wrote:
> >
> > +1 (non binding)
> >
> > I checked:
> > - hash and signature are good
> > - source distribution is good
> > - run a quick test locally
> >
> > Thanks,
> > Regards
> > JB
> >
> > On Mon, Oct 16, 2023 at 11:28 PM Fokko Driesprong 
> wrote:
> >>
> >> Hi Everyone,
> >>
> >>
> >> I propose that we release the following RC as the official PyIceberg
> 0.5.1 release.
> >>
> >>
> >> This is a patch release due to a bug that has been found. Smaller bugs
> also have been backported.
> >>
> >>
> >> The commit ID is ea9da8856a686eaeda0d5c2be78d5e3102b67c44
> >>
> >>
> >> * This corresponds to the tag: pyiceberg-0.5.1rc1
> (320b0f499d14178210c3b9cb7d94dab1e1b149e6)
> >>
> >> *
> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.5.1rc1
> >>
> >> *
> https://github.com/apache/iceberg-python/tree/ea9da8856a686eaeda0d5c2be78d5e3102b67c44
> >>
> >>
> >> The release tarball, signature, and checksums are here:
> >>
> >>
> >> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.1rc1/
> >>
> >>
> >> You can find the KEYS file here:
> >>
> >>
> >> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> >>
> >>
> >> Convenience binary artifacts are staged on pypi:
> >>
> >>
> >> https://pypi.org/project/pyiceberg/0.5.1rc1/
> >>
> >>
> >> And can be installed using: pip3 install pyiceberg==0.5.1rc1
> >>
> >>
> >> Please download, verify, and test.
> >>
> >>
> >> Please vote in the next 72 hours.
> >>
> >> [ ] +1 Release this as PyIceberg 0.5.1
> >>
> >> [ ] +0
> >>
> >> [ ] -1 Do not release this because...
> >>
> >>
> >> Kind regards,
> >>
> >> Fokko
>
>


Re: [VOTE] Release Apache Iceberg 1.4.1 RC0

2023-10-19 Thread Fokko Driesprong
Thanks Eduard for running this release!

+1 (binding):

   - Checked the sha/signature
   - Ran our example notebooks against 1.4.1
    and it
   looks well

Xuanwo, if you want to learn more about voting, there is also an Apache
page on it

(that includes some suggestions :). But also feel welcome to ask on the
devlist here.

Kind regards,
Fokko


Op do 19 okt 2023 om 11:02 schreef Xuanwo :

> That said, from a community standpoint, it's good to take any -1 (binding
> or non binding) into account.
>
> In your case, I would have voted -0 (to avoid confusion).
>
>
> Lesson learned. Next time, if the same situation occurs, I'll vote -0 to
> make my statement more clear.
>
> On Thu, Oct 19, 2023, at 16:23, Jean-Baptiste Onofré wrote:
>
> By the way, at Apache, it's not really possible to veto or block a
> release: you need three binding votes, even if we have a fourth binding
> vote with -1, the release can pass.
> That said, from a community standpoint, it's good to take any -1 (binding
> or non binding) into account.
>
> In your case, I would have voted -0 (to avoid confusion).
>
> You can see that I voted +1 because:
> - the release is the same as the previous ones
> - the issues have been identified and so we can fix it
>
> Regards
> JB
>
> On Thu, Oct 19, 2023 at 10:15 AM Xuanwo  wrote:
>
>
> You can see it’s what I mentioned in my vote email. However, as it’s like
> this for a while, I voted +1 and I have PRs ready to be submitted
> (including rat execution).
>
> So do you think it’s blocking ?
>
>
> Thanks for the clarification.
>
> I'm voting -1 due to the reasons mentioned, but it doesn't block this
> release (especially since it's non-binding). This release can proceed once
> it garners enough +1 votes. My -1 vote is simply to highlight areas we
> could improve in future releases.
>
>
> On Thu, Oct 19, 2023, at 13:11, Jean-Baptiste Onofré wrote:
>
> Hi
>
> You can see it’s what I mentioned in my vote email. However, as it’s like
> this for a while, I voted +1 and I have PRs ready to be submitted
> (including rat execution).
>
> So do you think it’s blocking ?
>
> Regards
> JB
>
> Le mer. 18 oct. 2023 à 16:27, Xuanwo  a écrit :
>
>
> -1 (non-binding)
>
> - checksum and signature is good
>
> - the following files not have license
>   - .baseline/idea/intellij-java-palantir-style.xml
>   - .baseline/checkstyle/checkstyle.xml
>   - gradle/libs.versions.toml
>   - .baseline/checkstyle/checkstyle-suppressions.xml
>   - .baseline/checkstyle/checkstyle-suppressions.xml
>
> - release contains binary files
>   -
> core/src/test/resources/org/apache/iceberg/puffin/v1/empty-puffin-uncompressed.bin
>   -
> core/src/test/resources/org/apache/iceberg/puffin/v1/sample-metric-data-compressed-zstd.bin
>   -
> core/src/test/resources/org/apache/iceberg/puffin/v1/sample-metric-data-uncompressed.bin
>
> On Wed, Oct 18, 2023, at 21:55, Eduard Tudenhoefner wrote:
>
> +1 (non-binding)
>
> * validated checksum and signature
> * checked license docs & ran RAT checks
> * ran build and tests with JDK8
> * ran into one test failure, which is reported in
> https://github.com/apache/iceberg/issues/8824, but this shouldn't block
> the release
> * tested with Trino in https://github.com/trinodb/trino/pull/19434
>
> On Wed, Oct 18, 2023 at 3:15 PM Jean-Baptiste Onofré 
> wrote:
>
> +1 (non binding)
>
> I checked:
> * hashes and signatures are OK
> * I did quick tests using spark 3.5
>
> I found the following issues that we should fix:
> * the source distribution contains two binary files (used for
> tests, empty-puffin-uncompressed.bin
> and sample-metric-data-uncompressed.bin). Binary files should not be
> included in the source distribution.
> * some files don't contain ASF header
>
> I will work to fix these issues, and also, I will propose to include rat
> to test our distribution.
>
> Regards
> JB
>
>
> On Wed, Oct 18, 2023 at 11:15 AM Eduard Tudenhoefner 
> wrote:
>
> Hi Everyone,
>
> I propose that we release the following RC as the official Apache Iceberg
> 1.4.1 release.
>
> The commit ID is 445664fb8d82950215872cbfec91e37c5fa0920f
> * This corresponds to the tag: apache-iceberg-1.4.1-rc0
> * https://github.com/apache/iceberg/commits/apache-iceberg-1.4.1-rc0
> *
> https://github.com/apache/iceberg/tree/445664fb8d82950215872cbfec91e37c5fa0920f
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.4.1-rc0
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1147/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 1.4.1
> [ ] +0
> [ ] -1 Do not relea

Re: Request access to iceberg slack channel

2023-10-19 Thread Fokko Driesprong
Hey Alessio,

Everyone is welcome on the Iceberg slack. What kind of error are you
seeing? Can you try this link:
https://join.slack.com/t/apache-iceberg/shared_invite/zt-2561tq9qr-UtISlHgsdY3Virs3Z2_btQ

Kind regards,
Fokko

Op do 19 okt 2023 om 19:11 schreef Alessio Izzo :

> Hello,
> I'd like to request access to the slack channel because I'm working with
> apache iceberg and I also did a small contribution on the python api (
> https://github.com/apache/iceberg/pull/8286).
> I tried to follow the slack link in the website but I got the error that I
> do not have the apache.org email.
> I'm sorry but maybe I did not understand how permissions works, but since
> I am using iceberg at work and I would like to contribute in the future,
> being in the slack channel would help.
> Please let me know if there is anything I can do.
> Thanks a lot, regards.
>
> Alessio
>


Re: [VOTE] Release Apache PyIceberg 0.5.1 (RC1)

2023-10-22 Thread Fokko Driesprong
Hey Dan,

Thanks for running these additional checks, and I agree that this should be
fixed. Let's cancel RC1 and I'll cherry pick the fixes and come up with RC2.

Kind regards,
Fokko

Op za 21 okt 2023 om 14:59 schreef Daniel Weeks 

> I created #88 <https://github.com/apache/iceberg-python/pull/88> to
> address the last statement because the parsing wasn't configured to require
> a full expression statement match.
>
> Looking at the 'like' tests, I think it missed on the syntax, which should
> require a `%` to be sql compliant (currently appears to just be evaluating
> to "starts with").
>
> -Dan
>
> On Fri, Oct 20, 2023 at 6:15 PM Daniel Weeks  wrote:
>
>> Fokko, I think I found a similar filter problems while experimenting:
>>
>> Using a filter like: t.scan().filter("location_id in (1,2,3)").to_arrow()
>> appears to filter correctly.
>>
>> However, a "like" query silently filters everything out: 
>> t.scan().filter("location_id
>> in (1,2,3) and zone_name like 'Jam%'").to_arrow()
>>
>> A query like: t.scan().filter("location_id in (1,2,3) and
>> lower(zone_name) = 'Jamaica Bay'").to_arrow() only applies the first
>> predicate and silently ignores the second.
>>
>> Overall, I'm -1 as I think we have larger issues than just the one case.
>>
>> -Dan
>>
>>
>> On Fri, Oct 20, 2023 at 12:51 PM Ryan Blue  wrote:
>>
>>> Fokko clarified offline that the commit I was looking for wasn't moved
>>> over to the iceberg-python repo because it didn't affect files in the
>>> python/ directory. The last commit that did was
>>> https://github.com/apache/iceberg/commit/187c9441a1830d323c862136e74f83876ab400c8,
>>> which is in the 0.5.x branch's history.
>>>
>>> Looks good to me, so I'll vote +1 (binding)
>>>
>>> On Fri, Oct 20, 2023 at 12:39 PM Ryan Blue  wrote:
>>>
>>>> The release build looks fine:
>>>> - Ran RAT checks
>>>> - Validated signature and checksum
>>>> - Ran tests in Python 3.10.5
>>>>
>>>> Unfortunately, I haven't been able to verify the set of changes. I was
>>>> looking at the 0.5.0-rc3 tag in the main repo:
>>>> https://github.com/apache/iceberg/commit/f798b06246e67131d413dfceece5ccaf269e01fe
>>>>
>>>> I don't see that commit in the 0.5.x branch. Where did 0.5.x branch
>>>> from?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Oct 19, 2023 at 5:12 AM Amogh Jahagirdar 
>>>> wrote:
>>>>
>>>>> +1 non-binding
>>>>>
>>>>> Verified signature and checksum, RAT checks, and ran all
>>>>> unit/integration tests.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Amogh
>>>>>
>>>>> On Thu, Oct 19, 2023 at 2:23 AM Fokko Driesprong 
>>>>> wrote:
>>>>>
>>>>>> +1 (binding) from me as well.
>>>>>>
>>>>>> I ran the example notebooks against the REST catalog
>>>>>> <https://github.com/tabular-io/docker-spark-iceberg/pull/108>.
>>>>>>
>>>>>> Cheers, Fokko
>>>>>>
>>>>>> Op do 19 okt 2023 om 02:37 schreef Rushan Jiang <
>>>>>> jonasjiang@gmail.com>:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> - Verified signatures and checksums
>>>>>>> - Verified license
>>>>>>> - Ran unit tests and integration tests via make test-coverage
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jonas
>>>>>>>
>>>>>>>
>>>>>>> > On Oct 16, 2023, at 23:14, Jean-Baptiste Onofré 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > +1 (non binding)
>>>>>>> >
>>>>>>> > I checked:
>>>>>>> > - hash and signature are good
>>>>>>> > - source distribution is good
>>>>>>> > - run a quick test locally
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Regards
>>>>>>> > JB
>>>>>>> >
>>>>>>> > On Mon, Oct 16, 2023 at 11:28 PM Fokko Driespro

[VOTE] Release Apache PyIceberg 0.5.1 RC2

2023-10-24 Thread Fokko Driesprong
Hi Everyone,

I propose that we release the following RC as the official PyIceberg 0.5.1
release.

This is a patch release due to bugs:

- Part of the expression is ignored when multiple and/or expressions are
specified 
- Update like statements to reflect sql behaviors


That has been found. Smaller bugs also have been backported
.

The commit ID is 891b4c7f4214fb9118080ce2215a210a770a5019

* This corresponds to the tag: pyiceberg-0.5.1rc2 (
c5085159079fe100b7fbd38b5037d1408525dc46)
* https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.5.1rc2
* https://github.com/apache/iceberg-python/tree/

891b4c7f4214fb9118080ce2215a210a770a5019


The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.1rc2/

You can find the KEYS file here:

* https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.5.1rc2/

And can be installed using: pip3 install pyiceberg==0.5.1rc2

Please download, verify, and test.

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.5.1
[ ] +0
[ ] -1 Do not release this because...

Consider this mail my +1 vote (binding) after running against our example
notebooks .

Kind regards, Fokko


Re: [VOTE] Release Apache Iceberg 1.4.2 RC0

2023-10-30 Thread Fokko Driesprong
Thanks the quick followup Amogh!

+1 (binding)

Verified sigs/sums/license/build and ran against our example notebooks
.

Kind regards,
Fokko


Op ma 30 okt 2023 om 04:42 schreef Daniel Weeks :

> +1 (binding)
>
> Verified sigs/sums/license/build/test (Java 11)
>
> -Dan
>
> On Sat, Oct 28, 2023 at 10:56 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (non binding)
>>
>> I checked:
>> - hash and signature are good
>> - ASF headers are still missing on some files (.baseline, etc): it has
>> been fixed on main but not cherry picked on 1.4.x branch
>> - still puffin binary files in the source distribution (I will work on
>> a fix about that)
>> - build ok from source distribution
>> - NB: dist area (both release and dev) should be cleaned from old
>> releases. I will tackle that.
>>
>> Thanks,
>> Regards
>> JB
>>
>> On Sat, Oct 28, 2023 at 11:09 PM Amogh Jahagirdar 
>> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I propose that we release the following RC as the official Apache
>> Iceberg 1.4.2 release.
>> >
>> > The commit ID is f6bb9173b13424d77e7ad8439b5ef9627e530cb2
>> > * This corresponds to the tag: apache-iceberg-1.4.2-rc0
>> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.4.2-rc0
>> > *
>> https://github.com/apache/iceberg/tree/f6bb9173b13424d77e7ad8439b5ef9627e530cb2
>> >
>> > The release tarball, signature, and checksums are here:
>> > *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.4.2-rc0
>> >
>> > You can find the KEYS file here:
>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> >
>> > Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> > * https://repository.apache.org/content/repositories/org apache
>> iceberg-1148
>> >
>> > This release includes a patch for ensuring engines can successfully
>> read tables when their split offset metadata was corrupted due to a bug in
>> 1.4.0. See https://github.com/apache/iceberg/pull/8925 for more details.
>> >
>> > Please download, verify, and test.
>> >
>> > Please vote in the next 72 hours.
>> > [ ] +1 Release this as Apache Iceberg 1.4.2
>> > [ ] +0
>> > [ ] -1 Do not release this because...
>> >
>> > Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> > non-binding votes. This vote will pass if there are 3 binding +1 votes
>> and more binding
>> > +1 votes than -1 votes.
>>
>


Re: [VOTE] Release Apache PyIceberg 0.5.1 RC2

2023-10-30 Thread Fokko Driesprong
Thanks everyone for voting! I'll go ahead with the release since we have 5
positive votes:

+1 Rushan (non-binding)
+1 JB (non-binding)
+1 Dan Weeks (binding)
+1 Ryan Blue (binding)
+1 Fokko Driesprong (binding)

Thanks everyone, and I'll send out the announcement when all the artifacts
are published

Kind regards,
Fokko


Op za 28 okt 2023 om 21:24 schreef Ryan Blue :

> +1 (binding)
>
> Verified the same way I did last time.
>
> On Fri, Oct 27, 2023 at 5:21 PM Daniel Weeks  wrote:
>
>> +1 (binding)
>>
>> Verified sigs/sums/license/install/test (python 3.10)
>>
>> Ran extensive filter tests and everything worked as expected with
>> Arrow/Pandas/DuckDB.
>>
>> -Dan
>>
>> On Fri, Oct 27, 2023 at 3:02 PM Hussein Awala  wrote:
>>
>>> +1 (non-binding) I ran the example notebooks and tested some queries
>>> with PyArrow and Pandas, all looks good.
>>>
>>> On Fri, Oct 27, 2023 at 11:46 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> +1 (non binding)
>>>>
>>>> I checked:
>>>> - hash and signatures are good
>>>> - I will check NOTICE (copyright is 2022 and I think some deps are
>>>> missing there), not release blocker
>>>> - ASF headers are present
>>>> - no binary file detected
>>>> - very quick test
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Tue, Oct 24, 2023 at 8:48 PM Fokko Driesprong 
>>>> wrote:
>>>> >
>>>> > Hi Everyone,
>>>> >
>>>> > I propose that we release the following RC as the official PyIceberg
>>>> 0.5.1 release.
>>>> >
>>>> > This is a patch release due to bugs:
>>>> >
>>>> > - Part of the expression is ignored when multiple and/or expressions
>>>> are specified
>>>> > - Update like statements to reflect sql behaviors
>>>> >
>>>> > That has been found. Smaller bugs also have been backported.
>>>> >
>>>> > The commit ID is 891b4c7f4214fb9118080ce2215a210a770a5019
>>>> >
>>>> > * This corresponds to the tag: pyiceberg-0.5.1rc2
>>>> (c5085159079fe100b7fbd38b5037d1408525dc46)
>>>> > *
>>>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.5.1rc2
>>>> > *
>>>> https://github.com/apache/iceberg-python/tree/891b4c7f4214fb9118080ce2215a210a770a5019
>>>> >
>>>> > The release tarball, signature, and checksums are here:
>>>> >
>>>> > * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.5.1rc2/
>>>> >
>>>> > You can find the KEYS file here:
>>>> >
>>>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>> >
>>>> > Convenience binary artifacts are staged on pypi:
>>>> >
>>>> > https://pypi.org/project/pyiceberg/0.5.1rc2/
>>>> >
>>>> > And can be installed using: pip3 install pyiceberg==0.5.1rc2
>>>> >
>>>> > Please download, verify, and test.
>>>> >
>>>> > Please vote in the next 72 hours.
>>>> > [ ] +1 Release this as PyIceberg 0.5.1
>>>> > [ ] +0
>>>> > [ ] -1 Do not release this because...
>>>> >
>>>> > Consider this mail my +1 vote (binding) after running against our
>>>> example notebooks.
>>>> >
>>>> > Kind regards, Fokko
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>


[ANNOUNCE] PyIceberg 0.5.1 release

2023-10-30 Thread Fokko Driesprong
Hi everyone,

I'm delighted to announce the PyIceberg 0.5.1 release. This is a patch
release that fixes a bug in the parsing of the SQL statements. It
also brings in some smaller bugfixes, for details, please check the 0.5.1
milestone <https://github.com/apache/iceberg-python/milestone/3>. It is
highly recommended to update to this version. Thanks everyone for
contributing!

Kind regards,
Fokko Driesprong


Re: Updating the Iceberg table architecture diagram

2023-11-03 Thread Fokko Driesprong
Hey Jason, thanks for updating the chart.

I like it a lot. However, there are a lot of boxes and new terms. What do
you think of keeping both files, and indicating that the old applies to V1
tables, and the new one to V2 tables.

Kind regards,
Fokko

Op vr 3 nov 2023 om 14:37 schreef Aaron Niskode-Dossett
:

> An update would be greatly appreciated, thank you!
>
> On Thu, Nov 2, 2023 at 12:42 PM Jason Hughes 
> wrote:
>
>> Hey all,
>>
>> The current architecture diagram
>>  for an iceberg
>> table hasn't been updated in over 3 years, and there's are some aspects to
>> the architecture of an iceberg table that have changed, most notably delete
>> files and puffin files. since this diagram gets a lot of use in enablement
>> content around the community and isn't totally accurate anymore, @Ajantha
>> Bhat U  and I discussed updating it to be more
>> accurate
>>
>> here's an updated version of the diagram
>> 
>> we put together
>>
>> a few points for discussion that we're interested in others' thoughts on:
>>
>>1. the diagram is obviously somewhat more visually complicated than
>>the current one, but IMO the benefit of being more accurate for people
>>learning iceberg outweighs the additional complexity
>>2. since the partition stats spec PR
>> just got merged, we
>>thought it'd be good to include that too while we're updating it, and
>>combine puffin files with partition stats files into one category of files
>>in the diagram labeled "statistics files". we combined them in the 
>> diagram,
>>rather than splitting them up, because 1. it provides a simpler diagram, 
>> 2.
>>gets the primary point across, and 3. they both serve the purpose of
>>providing statistics for tools to leverage (albeit for different use 
>> cases)
>>3. we put statistics files in place in the diagram for both s0 and
>>s1, though we could only have statistics files for s1, which would 1. make
>>the diagram simpler, and 2. show a simple example of the use case of not
>>needing stats files initially, but then as data grows and/or query 
>> patterns
>>change, now stats files are needed
>>
>> if folks are on board with updating the diagram, and after we come to a
>> conclusion on the above discussion points and any others that come up, I
>> can export it to a png and create a PR to update the arch diagram image on
>> the site
>>
>> thanks!
>>
>>
>> Jason Hughes
>>
>>
>> Dremio | Director of Technical Advocacy
>>
>>
>>
>>
>>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>


Re: Add me to slack channel

2023-11-05 Thread Fokko Driesprong
Hey Sardar,

Please use the following URL:
https://join.slack.com/t/apache-iceberg/shared_invite/zt-2561tq9qr-UtISlHgsdY3Virs3Z2_btQ
The
slack channel should be public, let me know what you ran into. We're always
happy to answer questions.

Kind regards,
Fokko


Op zo 5 nov 2023 om 20:02 schreef Sardar Khan
:

> Hi,
> I have a few questions regards to the new deltalakemigration method here:
> https://iceberg.apache.org/docs/1.3.0/delta-lake-migration/
>
> Could you please add me to the slack channel, so I could ask my questions,
>
> Best,
> Sardar Khan
>
>
>
> --
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>


Re: Slack community

2023-11-20 Thread Fokko Driesprong
Hey,

Thanks for reaching out. I'll make sure to update the Slack URL. Can you
check using:
 
https://join.slack.com/t/apache-iceberg/shared_invite/zt-27f22riz7-o8nCsl5Vbc_2h6~3DF6qlw


Kind regards,
Fokko

Op ma 20 nov 2023 om 19:00 schreef ​ :

> Hello,
>
> I was hoping to contribute to and ask questions on the slack. Does this
> require an @apache email address?
>
> When I try to join the slack via the invite it claims I can't join unless
> I have an @apache email address.
>


Re: [PROPOSAL] Apache Iceberg 1.4.3 release

2023-12-02 Thread Fokko Driesprong
Hey JB,

I think there is no harm in doing a patch release.

There was another request to backport an issue, I've created a PR:
https://github.com/apache/iceberg/pull/8969#issuecomment-1837286383

Kind regards,
Fokko

Op wo 22 nov 2023 om 18:50 schreef Jean-Baptiste Onofré :

> Hi guys
>
> Quick update about that:
> 1. I took a deeper look today about the Avro CVE issue. I don't think
> we are impacted on Iceberg (the CVE is about deserialization of
> corrupted data potentially causing out of memory). The fix
> (https://github.com/apache/avro/commit/a12a7e44d) introduces
> SystemLimitException that uses system properties to define boundaries
> and avoid the OOM (even if the deserialization won't still work :)).
> So, nothing really changes from an Iceberg perspective.
> 2. As discussed during the community meeting today, as (1) doesn't
> really have an impact on Iceberg, there's no urgency to release 1.4.3.
> We agreed to wait new fixes for 1.4.3 release.
>
> I'm still volunteering to cut the 1.4.3 patch release when ready (I
> did all the build checks on my machine :)), and I'm doing a pass on GH
> issues.
>
> Thanks !
> Regards
> JB
>
> On Tue, Nov 21, 2023 at 8:49 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi
> >
> > We chatted about the 1.4.3 release with Ed.
> >
> > We have few PRs we want to include and as it’s Thanksgiving this week, I
> will submit the release to vote on Tuesday next week.
> >
> > Regards
> > JB
> >
> > Le lun. 20 nov. 2023 à 17:24, Jean-Baptiste Onofré  a
> écrit :
> >>
> >> Thanks Fokko !
> >>
> >> I'm on the local build check and issue pass. I plan to start the
> >> release tomorrow.
> >>
> >> Regards
> >> JB
> >>
> >> On Mon, Nov 20, 2023 at 8:56 AM Driesprong, Fokko 
> wrote:
> >> >
> >> > I took the liberty and created a 1.4.3 milestone to track any issues
> that we want to backport.
> >> >
> >> > Kind regards,
> >> > Fokko Driesprong
> >> >
> >> > Op ma 20 nov 2023 om 08:50 schreef Driesprong, Fokko
> :
> >> >>
> >> >> Hey JB,
> >> >>
> >> >> Late to the party here, but 1.4.3 sounds like a great idea. Let me
> know if you need any help with any release steps.
> >> >>
> >> >> Kind regards,
> >> >> Fokko Driesprong
> >> >>
> >> >> Op ma 20 nov 2023 om 08:16 schreef Jean-Baptiste Onofré <
> j...@nanthrax.net>:
> >> >>>
> >> >>> Hi
> >> >>>
> >> >>> As there's no objection, I will move forward and prepare the
> release to vote.
> >> >>>
> >> >>> I will keep you posted asap.
> >> >>>
> >> >>> Thanks,
> >> >>> Regards
> >> >>> JB
> >> >>>
> >> >>> On Wed, Nov 15, 2023 at 6:11 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >>> >
> >> >>> > Hi guys,
> >> >>> >
> >> >>> > Avro 1.11.3 has been released, fixing CVE-2023-39410.
> >> >>> > We already updated to Avro 1.11.3 on main.
> >> >>> >
> >> >>> > About CVE, we also already use guava 32.1.3, fixing CVE-2023-2976.
> >> >>> >
> >> >>> > As the Avro CVE is classified high (see
> >> >>> > https://nvd.nist.gov/vuln/detail/CVE-2023-39410), I propose to
> bump to
> >> >>> > Avro 1.11.3 on our 1.4.x branch and release Iceberg 1.4.3
> including
> >> >>> > this.
> >> >>> >
> >> >>> > Thoughts ?
> >> >>> >
> >> >>> > If there are no objections, I'm volunteer to drive this release.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Regards
> >> >>> > JB
>


Re: Is there a way to distcp iceberg table from hadoop?

2023-12-02 Thread Fokko Driesprong
Hi Dongjun,

Thanks for reaching out on the mailinglist. Another option might be to copy
the data, and then use a Spark procedure, called add_files
 to add
the files to the table. Let me know if this works for you.

Kind regards,
Fokko

Op za 2 dec 2023 om 02:43 schreef Ajantha Bhat :

> Hi,
>
> You are right. Moving Iceberg tables from storage and expecting them to
> function at the new location is not currently feasible.
> The issue lies in the metadata files, which store the absolute path.
>
> To address this, we need support for relative paths, but it appears that
> progress on this front has been slow.
> You can monitor the status of this feature at
> https://github.com/apache/iceberg/pull/8260.
>
> As a temporary fix, you can use the CTAS method to create a duplicate copy
> of the table at the desired new path.
>
> Thanks,
> Ajantha
>
> On Fri, Dec 1, 2023 at 10:01 PM Dongjun Hwang 
> wrote:
>
>> Hello! My name is Dongjun Hwang.
>>
>> I recently performed distcp on the iceberg table in Hadoop.
>>
>> Data search was not possible because all file paths in the metadata
>> directory were not changed.
>>
>> Is there a way to distcp the iceberg table?
>>
>> thang you!!
>>
>


Re: Proposal for REST APIs for Iceberg table scans

2023-12-12 Thread Fokko Driesprong
Hey Rahil and Jack,

Thanks for bringing this up. Ryan and I also discussed this briefly in the
early days of PyIceberg and it would have helped a lot in the speed of
development. We went for the traditional approach because that would also
support all the other catalogs, but now that the REST catalog is taking
off, I think it still makes a lot of sense to get it in.

I do share the concern raised Ryan around the concepts of shards and
pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
living in a single process today the concept of shards doesn't add value. I
see your concern with long-running jobs, but for the non-distributed cases,
it will add additional complexity.

Some suggestions that come to mind:

   - Stream the tasks directly back using a chunked response, reducing the
   latency to the first task. This would also solve things with the
   pagination. The only downside I can think of is having delete files where
   you first need to make sure there are deletes relevant to the task, this
   might increase latency to the first task.
   - Making the sharding optional. If you want to shard you call the
   CreateScan first and then call the GetScanTask with the IDs. If you don't
   want to shard, you omit the shard parameter and fetch the tasks directly
   (here we need also replace the scan string with the full
   column/expression/snapshot-id etc).

Looking forward to discussing this tomorrow in the community sync
!

Kind regards,
Fokko



Op ma 11 dec 2023 om 19:05 schreef Jack Ye :

> Hi Ryan, thanks for the feedback!
>
> I was a part of this design discussion internally and can provide more
> details. One reason for separating the CreateScan operation was to make the
> API asynchronous and thus keep HTTP communications short. Consider the case
> where we only have GetScanTasks API, and there is no shard specified. It
> might take tens of seconds, or even minutes to read through all the
> manifest list and manifests before being able to return anything. This
> means the HTTP connection has to remain open during that period, which is
> not really a good practice in general (consider connection failure, load
> balancer and proxy load, etc.). And when we shift the API to asynchronous,
> it basically becomes something like the proposal, where a stateful ID is
> generated to be able to immediately return back to the client, and the
> client get results by referencing the ID. So in our current prototype
> implementation we are actually keeping this ID and the whole REST service
> is stateful.
>
> There were some thoughts we had about the possibility to define a "shard
> ID generator" protocol: basically the client agrees with the service a way
> to deterministically generate shard IDs, and service uses it to create
> shards. That sounds like what you are suggesting here, and it pushes the
> responsibility to the client side to determine the parallelism. But in some
> bad cases (e.g. there are many delete files and we need to read all those
> in each shard to apply filters), it seems like there might still be the
> long open connection issue above. What is your thought on that?
>
> -Jack
>
> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue  wrote:
>
>> Rahil, thanks for working on this. It has some really good ideas that we
>> hadn't considered before like a way for the service to plan how to break up
>> the work of scan planning. I really like that idea because it makes it much
>> easier for the service to keep memory consumption low across requests.
>>
>> My primary feedback is that I think it's a little too complicated (with
>> both sharding and pagination) and could be modified slightly so that the
>> service doesn't need to be stateful. If the service isn't necessarily
>> stateful then it should be easier to build implementations.
>>
>> To make it possible for the service to be stateless, I'm proposing that
>> rather than creating shard IDs that are tracked by the service, the
>> information for a shard can be sent to the client. My assumption here is
>> that most implementations would create shards by reading the manifest list,
>> filtering on partition ranges, and creating a shard for some reasonable
>> size of manifest content. For example, if a table has 100MB of metadata in
>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>> 1-4 manifests each. The service could send those shards to the client as a
>> list of manifests to read and the client could send the shard information
>> back to the service to get the data files in each shard (along with the
>> original filter).
>>
>> There's a slight trade-off that the protocol needs to define how to break
>> the work into shards. I'm interested in hearing if that would work with how
>> you were planning on building the service on your end. Another option is to
>> let the service send back arbitrary JSON that would get returned for each
>> shard. Eit

Re: [DISCUSS] Apache Iceberg 1.4.3

2023-12-19 Thread Fokko Driesprong
+1

Would be great to have the abovementioned fix out to the public, and some
other small fixes are worth releasing (see milestone)!

Thanks,

Fokko

Op di 19 dec 2023 om 08:55 schreef Ajantha Bhat :

> +1 for 1.4.3 release with #9227 ASAP.
>
> Looks like Trino does manual retry and this issue is marked as release
> blocker
> https://github.com/trinodb/trino/issues/20092
>
> Thanks,
> Ajantha
>
> On Tue, Dec 19, 2023 at 1:10 PM Eduard Tudenhoefner <
> etudenhoef...@apache.org> wrote:
>
>> Hey everyone,
>>
>> I'd like to start a discussion to have a 1.4.3 release.
>>
>> #9227  discovered a
>> correctness issue, where changes might only be partially applied after
>> retries.
>>
>> I'm starting this thread to see if people have any other bug fixes that
>> should go out with 1.4.3 (besides the ones that are already included in the 
>> 1.4.3
>> milestone ).
>>
>> Thanks
>> Eduard
>>
>


Re: [PROPOSAL] Improvement on our PR flows

2024-01-03 Thread Fokko Driesprong
Nice! I fully agree with the abovementioned. I originally set up the
stalebot for the issues because I noticed that there were many issues
around old Spark versions that weren't even maintained anymore. I feel it
is better to either close or take action on an issue. For me, it makes
sense to extend this to PRs as well.

Same as Amogh said, always feel free to ping me when either a PR or issue
lingering and you need some eyes on it.

Kind regards,
Fokko

Op do 4 jan 2024 om 07:42 schreef Jean-Baptiste Onofré :

> Hi
>
> That's also the purpose of the reviewers file: having multiple
> reviewers per tag.
>
> Thanks guys for your feedback, I will move forward with the PR :)
>
> Regards
> JB
>
> On Thu, Jan 4, 2024 at 6:38 AM Ajantha Bhat  wrote:
> >
> > +1,
> >
> > Some of my PRs have been open for a long time and sometimes it doesn't
> get the attention it requires.
> > Notifying both the reviewer and the author can help expedite the review
> process and facilitate quicker handling of new contributions.
> > I think having more than one committer assigned for PR can also
> definitely help in speeding up the process if one of the committer is busy
> or on holiday.
> >
> > But we also need to think on the next steps. What if we still don't
> receive the necessary response even after sending notifications?
> > Should we have a slack channel for those PRs to conclude by discussing
> (or some guidelines on how to take it further).
> >
> > We can have a trial run for some days and see how it goes.
> >
> > Thanks,
> > Ajantha
> >
> > On Thu, Jan 4, 2024 at 8:19 AM Amogh Jahagirdar 
> wrote:
> >>
> >> +1, I think this is a step in the right direction. One other
> consideration I wanted to bring up was dependabot and if there's any unique
> handling we want to do there because I've noticed that PRs from dependabot
> tend to pile up. I think with the proposal we won't really need to do
> anything unique and just treat it as a normal PR (it would be a build label
> with its own set of reviewers) and we'll get notified the same way.
> >>
> >> I'll also say for reviews (speaking for myself, but I think many others
> probably feel this way as well), always feel free to ping on Slack and
> follow up :) But overall I do like having more of a mechanism.
>


[ANNOUNCE] New committer: Honah J.

2024-01-12 Thread Fokko Driesprong
On behalf of the Iceberg PMC, I'm happy to announce that Honah has accepted
an invitation to become a committer on Apache (Py)Iceberg. Welcome, and
thank you for your contributions!

Kind regards,
Fokko


Re: Proposed PyIceberg logo art

2024-01-15 Thread Fokko Driesprong
Love it Rick, thanks for sharing! I would love to have it as the
official PyIceberg logo!

I've checked the trademark of the Python logo, and they are okay with using
the logo for non-proprietary use. They recommend checking it with the PSF
anyway since we combine logos here, so I'll do that now. I'll let you know
what comes out of it

Kind regards,
Fokko



Op ma 15 jan 2024 om 18:28 schreef Jean-Baptiste Onofré :

> Hi Rick,
>
> Thanks ! It looks great :)
>
> Regards
> JB
>
> On Mon, Jan 15, 2024 at 5:43 PM Rick Bilodeau  wrote:
> >
> > The artwork is linked here.
> >
> >
> https://drive.google.com/file/d/1oV9qolfyIi5YkEvA0NJb3ipECGBdsUqX/view?usp=sharing
> >
> >
> >
> > On Mon, Jan 15, 2024 at 8:35 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Rick,
> >>
> >> It's not possible to attach files on the mailing list. Can you please
> >> share the logo somewhere ?
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On Mon, Jan 15, 2024 at 5:31 PM Rick Bilodeau  wrote:
> >> >
> >> >
> >> > Hello,
> >> >
> >> > I'd like to propose adopting the attached artwork as the official
> PyIceberg logomark. If it is accepted, Tabular would donate the creative
> assets to the ASF.
> >> >
> >> > Best,
> >> >
> >> > Rick Bilodeau
> >> > Head of Marketing
> >> > Tabular
>


Re: Proposed PyIceberg logo art

2024-01-23 Thread Fokko Driesprong
Thanks everyone for the input.

In the meantime, I reached out to the PSF, and it turns out that it is not
allowed to alter the original Python logo. The gap that is between the two
snakes is not allowed. Next, the Python logo must be distinct from any
other logos. Rick is looking into doing another iteration to address the
issues raised by the PSF.

Kind regards,
Fokko

Op wo 17 jan 2024 om 01:45 schreef Renjie Liu :

> If we have reached consensus on the multiple logo approach, then +1 for
> other projects such as for iceberg-rust and iceberg-go.
>
> On Wed, Jan 17, 2024 at 1:29 AM Jack Ye  wrote:
>
>> The logo looks great!
>>
>> For the multiple logo concern, if we go with this approach, I think we
>> should host all the logo materials at some place, such as in a /logos
>> folder and published at iceberg.apache.org/logos. This also gives a
>> centralized place for people to more easily consume those logos with the
>> related licensing information.
>>
>> A related question, do we plan to create logos also for the other
>> projects iceberg-rust and iceberg-go?
>>
>> Best,
>> Jack Ye
>>
>> On Tue, Jan 16, 2024 at 1:09 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> That's a good point. That said, I don't see it as blocker as the
>>> PyIceberg logo "ships" the Iceberg logo.
>>>
>>> Regards
>>> JB
>>>
>>> On Tue, Jan 16, 2024 at 7:25 AM Zheng Hu  wrote:
>>> >
>>> > The logo looks great and lovely,  thanks Rick.
>>> >
>>> > But I'm not sure whether we need a separate Python Iceberg logo for
>>> the pyiceberg, which is another implementation for the apache iceberg.
>>> > Is there a potential risk of the same project's logo being split into
>>> two logos ?
>>> >
>>> > Best regards.
>>> >
>>> > On Tue, Jan 16, 2024 at 1:45 AM Fokko Driesprong 
>>> wrote:
>>> >>
>>> >> Love it Rick, thanks for sharing! I would love to have it as the
>>> official PyIceberg logo!
>>> >>
>>> >> I've checked the trademark of the Python logo, and they are okay with
>>> using the logo for non-proprietary use. They recommend checking it with the
>>> PSF anyway since we combine logos here, so I'll do that now. I'll let you
>>> know what comes out of it
>>> >>
>>> >> Kind regards,
>>> >> Fokko
>>> >>
>>> >>
>>> >>
>>> >> Op ma 15 jan 2024 om 18:28 schreef Jean-Baptiste Onofré <
>>> j...@nanthrax.net>:
>>> >>>
>>> >>> Hi Rick,
>>> >>>
>>> >>> Thanks ! It looks great :)
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On Mon, Jan 15, 2024 at 5:43 PM Rick Bilodeau 
>>> wrote:
>>> >>> >
>>> >>> > The artwork is linked here.
>>> >>> >
>>> >>> >
>>> https://drive.google.com/file/d/1oV9qolfyIi5YkEvA0NJb3ipECGBdsUqX/view?usp=sharing
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Jan 15, 2024 at 8:35 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >>> >>
>>> >>> >> Hi Rick,
>>> >>> >>
>>> >>> >> It's not possible to attach files on the mailing list. Can you
>>> please
>>> >>> >> share the logo somewhere ?
>>> >>> >>
>>> >>> >> Thanks !
>>> >>> >> Regards
>>> >>> >> JB
>>> >>> >>
>>> >>> >> On Mon, Jan 15, 2024 at 5:31 PM Rick Bilodeau 
>>> wrote:
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > Hello,
>>> >>> >> >
>>> >>> >> > I'd like to propose adopting the attached artwork as the
>>> official PyIceberg logomark. If it is accepted, Tabular would donate the
>>> creative assets to the ASF.
>>> >>> >> >
>>> >>> >> > Best,
>>> >>> >> >
>>> >>> >> > Rick Bilodeau
>>> >>> >> > Head of Marketing
>>> >>> >> > Tabular
>>>
>>


[DISCUSS] PyIceberg 0.6.0 release

2024-01-26 Thread Fokko Driesprong
Hey everyone,

I want to discuss the 0.6.0 release that will bring a lot of functionality
to the public:

   - Write support for writing to unpartitioned tables
  - Includes snapshot generation
  - Constructing Avro writer trees
   - Support writing metadata which allows to commit support for the Hive,
   Sql, and Glue catalog.
   - Support for name-mapping
   - Easy evolution of schema using the union_by_name method
   - And a lot of bug fixes and improvements

The write support is still limited, for example, partitioned writes or
tables with sort-orders are not supported. Also, as Ryan mentioned during
the last community sync, we're doing fast appends by default, and we're
unable to compact yet. I've created issues on Github
<https://github.com/apache/iceberg-python/issues> to track all these
limitations. However, I think it is good to get the current work out to the
public so they can try it and we can uncover any impediments as soon as
possible. And we can follow up with 0.7.0.

Kind regards,
Fokko Driesprong


Re: [DISCUSS] PyIceberg 0.6.0 release

2024-01-26 Thread Fokko Driesprong
Thanks everyone for the responses and great to see everyone is as excited
as I am :D

I have some good news. The guys from Eventual have been working on
integrating PyIceberg into their Daft dataframe
<https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/data_catalogs.html#apache-iceberg>.
They are integrating on the scan-tasks level where they leverage their own
Parquet reader to read in a distributed fashion. Feel free to join the
#daft channel on the Iceberg Slack
<https://iceberg.apache.org/community/#slack> if you're interested in this.
We're in the process of making sure that all the Iceberg features work well
(schema and partition evolution, projection, etc). The query planning is
done in PyIceberg in a single process (we do use multi-threading), we're
doing some profiling on the PyIceberg code to identify bottlenecks to scale
to at least 1M+ partitions.

Similar to the read-path, for writing, we're designing the API in such a
way that this also can be distributed.

As I mentioned, I created issues
<https://github.com/apache/iceberg-python/issues> around the gaps. There is
a good discussion going on around the partitioned writes
<https://github.com/apache/iceberg-python/issues/208>, and writing using a
sort order <https://github.com/apache/iceberg-python/issues/271> is still
up for grabs.

Kind regards,
Fokko

Op vr 26 jan 2024 om 19:45 schreef Ryan Blue :

> Like the Java implementation, we've been building toward a library that
> can be used in distributed applications as well as directly on a single
> node. For example, job planning can produce a set of file scan tasks or a
> scan can be pushed to duckdb (to_duckdb) or pandas (to_pandas). The write
> side is similar where we have methods that accept Arrow dataframes and
> write files and an API for committing those files to a table. The write
> side isn't as well developed yet (no support for partitions, for example),
> but the basics are there and we would love to work with Ray and other
> communities to add native Iceberg support!
>
> On Fri, Jan 26, 2024 at 10:40 AM Pucheng Yang 
> wrote:
>
>> I have similar questions as Yufei's. My organization has interest in Ray
>> Iceberg integration and during the conversation with the Ray team, we know
>> they would also like the have Iceberg integration as well. I think this is
>> a good opportunity for both projects to collaborate.
>>
>> On Fri, Jan 26, 2024 at 10:32 AM Sung Yun  wrote:
>>
>>> It’s so exciting to see the project take another step forward, Fokko!
>>>
>>> Really great job to everyone involved.
>>>
>>> Best,
>>> Sung
>>>
>>> On Jan 26, 2024, at 11:48 AM, Ryan Blue  wrote:
>>>
>>> 
>>> It's great to see all the progress in PyIceberg. Thanks to everyone
>>> that's been contributing!
>>>
>>> I'm all for getting a release out as soon as possible and following up
>>> with more features in the write path in 0.7.0.
>>>
>>> On Fri, Jan 26, 2024 at 5:22 AM Fokko Driesprong 
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> I want to discuss the 0.6.0 release that will bring a lot of
>>>> functionality to the public:
>>>>
>>>>- Write support for writing to unpartitioned tables
>>>>   - Includes snapshot generation
>>>>   - Constructing Avro writer trees
>>>>- Support writing metadata which allows to commit support for the
>>>>Hive, Sql, and Glue catalog.
>>>>- Support for name-mapping
>>>>- Easy evolution of schema using the union_by_name method
>>>>- And a lot of bug fixes and improvements
>>>>
>>>> The write support is still limited, for example, partitioned writes or
>>>> tables with sort-orders are not supported. Also, as Ryan mentioned during
>>>> the last community sync, we're doing fast appends by default, and we're
>>>> unable to compact yet. I've created issues on Github
>>>> <https://github.com/apache/iceberg-python/issues> to track all these
>>>> limitations. However, I think it is good to get the current work out to the
>>>> public so they can try it and we can uncover any impediments as soon as
>>>> possible. And we can follow up with 0.7.0.
>>>>
>>>> Kind regards,
>>>> Fokko Driesprong
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>>
>
> --
> Ryan Blue
> Tabular
>


Re: [DISCUSS] Release new Iceberg docs site in the main repository

2024-01-29 Thread Fokko Driesprong
I did some reviews of the PRs that led up to this, and I think the new site
is much easier to maintain and deploy. +1 from my end :)

Cheers, Fokko

Op ma 29 jan 2024 om 15:15 schreef Jean-Baptiste Onofré :

> +1
>
> Regards
> JB
>
> On Fri, Jan 26, 2024 at 11:40 PM Brian Olsen 
> wrote:
> >
> > Hey everyone,
> >
> > As discussed during the community sync, I'd like to get a vote on moving
> forward with the documentation. I have created a PR (
> https://github.com/apache/iceberg/pull/9520) that references the changes
> that have happened up to this point.
> >
> > Simpler contribution by collocating the website and documentation in the
> same repository.
> > We don't want the versioned docs or javadoc files to be tracked in the
> > main branch to avoid multiple copies of the docs being indexed in GitHub
> or
> > IDEs.
> > We need a top level (non-versioned) Iceberg website that links versioned
> > docs and contains evergreen constructs.
> > The current docs release process is cumbersome and the code lives across
> > multiple repositories making it difficult to know where to contribute for
> > documentation: https://github.com/apache/iceberg/issues/8151.
> > We wanted there to be an easy way to apply retroactive fixes to older doc
> > versions.
> > A simple release process can now be automated once we validate things
> work well manually by starting a workflow and reviewing a PR.
> > Restyle Mkdocs default theme to look like the existing Iceberg theme.
> > Fix broken links (there were a lot).
> >
> > It would be great to get a quick vote on moving forward with this
> process. Thanks!
> >
> > - Bits
>


Re: [DISCUSS] PyIceberg 0.6.0 release

2024-01-29 Thread Fokko Driesprong
Hey everyone,

Since #305 <https://github.com/apache/iceberg-python/pull/305> has been
merged, I think we're good for the release. Thank you Sung for the PR and
Honah for the great review! I think it would be nice to get #311
<https://github.com/apache/iceberg-python/pull/311> to get people started
with the write API. Let me know if anything is missing.

I'm happy to run the release, but always open to anyone else to run the
release <https://py.iceberg.apache.org/how-to-release/>.

Today at 1700 UTC we have the monthly PyIceberg sync. Feel free to join if
you're interested in contributing or if you have any questions. You can
attend by joining the Google group
<https://groups.google.com/search?q=iceberg-python-sync>, or by following
the link to the Google Calendar directly
<https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MG5oZnYxa2NhZjdvaHE5a2ZlMHJ0aG91OTZfMjAyNDAxMzBUMTcwMDAwWiBmb2trb0Bkcmllc3Byb25nZW4ubmw&tmsrc=fokko%40driesprongen.nl&scp=ALL>
.

Kind regards,
Fokko

Op zo 28 jan 2024 om 23:23 schreef Honah J. :

> Really excited for the upcoming 0.6.0 release and its new features! Big
> thanks to everyone for their hard work.
>
> I'm looking forward to the community feedback and future enhancements.
>
> Best regards,
> Honah
>
> On Fri, Jan 26, 2024 at 1:56 PM Daniel Weeks  wrote:
>
>> I'm also strongly in favor of getting this release out even with the
>> limitations as it's still a huge step forward and we can build
>> incrementally on the write support.
>>
>> Incredible work everyone, I'm really excited about the progress here.
>>
>> -Dan
>>
>> On Fri, Jan 26, 2024 at 11:16 AM Fokko Driesprong 
>> wrote:
>>
>>> Thanks everyone for the responses and great to see everyone is as
>>> excited as I am :D
>>>
>>> I have some good news. The guys from Eventual have been working on
>>> integrating PyIceberg into their Daft dataframe
>>> <https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/data_catalogs.html#apache-iceberg>.
>>> They are integrating on the scan-tasks level where they leverage their own
>>> Parquet reader to read in a distributed fashion. Feel free to join the
>>> #daft channel on the Iceberg Slack
>>> <https://iceberg.apache.org/community/#slack> if you're interested in
>>> this. We're in the process of making sure that all the Iceberg features
>>> work well (schema and partition evolution, projection, etc). The query
>>> planning is done in PyIceberg in a single process (we do use
>>> multi-threading), we're doing some profiling on the PyIceberg code to
>>> identify bottlenecks to scale to at least 1M+ partitions.
>>>
>>> Similar to the read-path, for writing, we're designing the API in such a
>>> way that this also can be distributed.
>>>
>>> As I mentioned, I created issues
>>> <https://github.com/apache/iceberg-python/issues> around the gaps.
>>> There is a good discussion going on around the partitioned writes
>>> <https://github.com/apache/iceberg-python/issues/208>, and writing
>>> using a sort order <https://github.com/apache/iceberg-python/issues/271>
>>> is still up for grabs.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op vr 26 jan 2024 om 19:45 schreef Ryan Blue :
>>>
>>>> Like the Java implementation, we've been building toward a library that
>>>> can be used in distributed applications as well as directly on a single
>>>> node. For example, job planning can produce a set of file scan tasks or a
>>>> scan can be pushed to duckdb (to_duckdb) or pandas (to_pandas). The write
>>>> side is similar where we have methods that accept Arrow dataframes and
>>>> write files and an API for committing those files to a table. The write
>>>> side isn't as well developed yet (no support for partitions, for example),
>>>> but the basics are there and we would love to work with Ray and other
>>>> communities to add native Iceberg support!
>>>>
>>>> On Fri, Jan 26, 2024 at 10:40 AM Pucheng Yang
>>>>  wrote:
>>>>
>>>>> I have similar questions as Yufei's. My organization has interest in
>>>>> Ray Iceberg integration and during the conversation with the Ray team, we
>>>>> know they would also like the have Iceberg integration as well. I think
>>>>> this is a good opportunity for both projects to collaborate.
>>>>>
>>>>>

Re: [DISCUSS] iceberg-rust 0.2.0 release

2024-01-31 Thread Fokko Driesprong
I'm all for the 0.2.0 release. Kudos to all the work so far. While the
functionality is limited today, a lot of things are already in progress and
it looks very promising. Also, running a release now will help to
streamline the release process.

Kind regards,
Fokko Driesprong

Op wo 31 jan 2024 om 17:28 schreef Jack Ye :

> Excited about the progress in Rust! +1 for releasing 0.2.0
>
> -Jack
>
> On Wed, Jan 31, 2024 at 8:26 AM Ryan Blue  wrote:
>
>> Thanks, Renjie! It's great to see all of the progress in Rust. I agree
>> with getting the code released and I'm looking forward to testing it out!
>>
>> On Wed, Jan 31, 2024 at 12:35 AM Xuanwo  wrote:
>>
>>> We have been working on the iceberg-rust project for a while. Although
>>> there is still much work to be done, I believe it is important to release
>>> it in order to attract more users and developers to join us.
>>>
>>> At current stage, we have implemented basic features that users need to
>>> read a table. It could be a good start!
>>>
>>> On Wed, Jan 31, 2024, at 14:46, Renjie Liu wrote:
>>>
>>> Hi, everyone:
>>>
>>> iceberg-rust <https://github.com/apache/iceberg-rust/> has been under
>>> active development for several months, and it now has several features, so
>>> I want to use this thread to discuss delivering the first release of this
>>> crate.
>>>
>>> Why this first release 0.2.0?
>>>
>>> Before iceberg-rust <https://github.com/apache/iceberg-rust/> was
>>> developed, there were already efforts trying to develop a rust version of
>>> iceberg, and it has been registered in crates.io under the name
>>> `iceberg`. We contacted the authors and they kindly transferred the
>>> ownership of this crate to apache, so that we can use `iceberg` as our
>>> crate's name. But due to the immutability of crate.io
>>> <https://crates.io/>'s package, we can only start with version 0.2.0.
>>>
>>> Where are we?
>>>
>>> Currently we have delivered following features:
>>>
>>> 1. Rest catalog, including manipulating namespaces, load table, create
>>> table, etc.
>>> 2. Serialization/deserialization of table metadata, manifests.
>>> 3. Documentation for this crate: https://rust.iceberg.apache.org/
>>>
>>> What's next?
>>>
>>> Eventually we will reach feature parity with java/python api, so that we
>>> can bring full feature support of the iceberg to rust ecosystems. For
>>> details of feature status, please check the `README.md` in github repo
>>> <https://github.com/apache/iceberg-rust/> .
>>>
>>> Xuanwo
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: [DISCUSS] Change iceberg-rust CI Settings to only require approval for new github users

2024-01-31 Thread Fokko Driesprong
Thanks for raising this Xuanwo,

I dislike having to give Approval for the CI both from a contributor and a
committer standpoint. Often I see people raising PRs that cause the CI to
fail but not knowing yet because it hasn't run yet. Having direct feedback
from the CI makes the PR cycle much faster. Also from a reviewer
perspective, I like to know if the CI passes before reviewing and this also
takes a bit of time. I don't think there is much risk since the Actions
have limited permissions, and all the repositories are actively looked at.

Kind regards,
Fokko Driesprong

Op wo 31 jan 2024 om 18:43 schreef Daniel Weeks :

> I agree with this change.  The defaults are not very community friendly
> and make contributing hard.
>
> -Dan
>
> On Wed, Jan 31, 2024 at 6:01 AM Xuanwo  wrote:
>
>> Hello, everyone
>>
>> I'm starting this thread to discuss the possibility of changing the CI
>> settings for
>> iceberg-rust to only require approval for new GitHub users. This will
>> improve the
>> experience for our contributors.
>>
>> We do not have self-hosted runners, so there is no attack surface from
>> the actions
>> side. Additionally, iceberg-rust does not involve many heavy load CI
>> tasks. Enabling
>> it for PRs by default does not add to the overall burden on ASF runners.
>>
>> I think this change could speed up our iteration process.
>>
>> **Footnotes**
>>
>> I started discussion on slack[1] without objection. So I opened a ticket
>> at [2]. The
>> INFRA team is interested in hearing our community's thoughts on this
>> list. Feel
>> free to leave your comments here.
>>
>> [1]
>> https://apache-iceberg.slack.com/archives/C05HTENMJG4/p1706686077901739
>> [2] https://issues.apache.org/jira/browse/INFRA-25444
>>
>> Xuanwo
>>
>


Re: [PROPOSAL] Create user mailing list ?

2024-02-02 Thread Fokko Driesprong
±0 for having a user mailing list.

I don't believe that having more channels will lead to better support. I
agree that the archiving capabilities of Slack are limited, and the search
is sub-optimal. But we should also make sure that the questions asked are
also integrated into the documentation. The new website will also have
search capabilities, which will also make the content easier to find.

I'm not against it if people feel like there is added value in it, just
want to make sure that we as a community make sure that all the channels
are being monitored if we go down that route. And how the Slackbot would
synchronize the two channels.

Kind regards,
Fokko

Op di 30 jan 2024 om 23:55 schreef Jack Ye :

> +1 for having a user mailing list.
>
> Do we envision the slack bot to be used for people in slack to participate
> in user list conversations, or the other way around, or both?
>
> Allowing people in slack to participate in user list conversations seems
> pretty achievable. Allowing people in the user list to participate in slack
> conversations means to forward all conversations in slack to the user list
> and might be quite noisy. But the advantage is that people would be able to
> search for those Slack questions and answers on the internet. Maybe we can
> consider a separated "slack" mailing list for that purpose.
>
> Best,
> Jack Ye
>
> On Tue, Jan 30, 2024 at 8:16 AM Jean-Baptiste Onofré 
> wrote:
>
>> AFAIR, some ASF projects are using slackbot to receive users requests
>> from the mailing list and can send messages to the mailing list.
>>
>> Let me do a quick research and get back to you.
>>
>> Regards
>> JB
>>
>> On Tue, Jan 30, 2024 at 3:14 PM Brian Olsen 
>> wrote:
>> >
>> > I do like the idea of making the Slack threads available through the
>> mailing list. Is there a slack bot you have in mind? How would the threads
>> appear in the mailing list?
>> >
>> > On Tue, Jan 30, 2024 at 7:13 AM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> Hi guys,
>> >>
>> >> If we have a few user questions on the dev mailing list, we have quite
>> >> a number on Slack.
>> >> It's completely fine but not easy to search the questions and find the
>> >> concrete answer.
>> >>
>> >> As most other Apache projects do, I propose to create a user mailing
>> >> list to invite people to ask questions and request help.
>> >> This mailing list can be browsed and searched on
>> >> https://lists.apache.org/ and can be moderated.
>> >> We can use slackbot to create a "bridge" between slack and the user
>> >> mailing list.
>> >>
>> >> Thoughts ?
>> >>
>> >> Regards
>> >> JB
>>
>


Re: [Discuss] Change iceberg-python and iceberg-go CI Settings to only require approval for first time contributors

2024-02-02 Thread Fokko Driesprong
+1

Op vr 2 feb 2024 om 08:47 schreef Eduard Tudenhoefner :

> +1
>
>
>
> On Fri 2. Feb 2024 at 04:56 Drew  wrote:
>
>> +1
>>
>> Thanks for bringing this up for PyIceberg Honah
>>
>> On Thu, Feb 1, 2024 at 5:35 PM Honah J.  wrote:
>>
>>> Hello everyone
>>>
>>> Inspired by our recent discussion regarding iceberg-rust's CI setting, I
>>> am starting this thread to gather feedback on changing the CI settings for
>>> iceberg-python and iceberg-go to only require approvals for new
>>> contributors.
>>>
>>> I think this will benefit contributors and align CI settings across all
>>> Iceberg repos.
>>>
>>> If there's consensus on this, I'll happily open a JIRA ticket to request
>>> these changes.
>>>
>>> Previous discussion:
>>> https://lists.apache.org/thread/sp1853jgp1lbdybgzdvv2m5cqhny5skr
>>>
>>> Best regards,
>>> Honah
>>>
>>


Re: [DISCUSS] iceberg-rust 0.2.0 release

2024-02-06 Thread Fokko Driesprong
Hey everyone,

Thanks for the great responses here. It looks like we can move this
forward. I've just merged bumping the version to 0.2.0
<https://github.com/apache/iceberg-rust/pull/181>. Is anyone interested in
running this release?

Kind regards,
Fokko

Op do 1 feb 2024 om 02:39 schreef Renjie Liu :

> Thanks everyone for the discussion.
>
> Since we have reached consensus on this, I'm happy to run the first
> release.
>
> On Thu, Feb 1, 2024 at 1:41 AM Daniel Weeks 
> wrote:
>
>> +1 for 0.2.0 release as well.  Really excited about the progress here.
>>
>> -Dan
>>
>> On Wed, Jan 31, 2024 at 9:36 AM Fokko Driesprong 
>> wrote:
>>
>>> I'm all for the 0.2.0 release. Kudos to all the work so far. While the
>>> functionality is limited today, a lot of things are already in progress and
>>> it looks very promising. Also, running a release now will help to
>>> streamline the release process.
>>>
>>> Kind regards,
>>> Fokko Driesprong
>>>
>>> Op wo 31 jan 2024 om 17:28 schreef Jack Ye :
>>>
>>>> Excited about the progress in Rust! +1 for releasing 0.2.0
>>>>
>>>> -Jack
>>>>
>>>> On Wed, Jan 31, 2024 at 8:26 AM Ryan Blue  wrote:
>>>>
>>>>> Thanks, Renjie! It's great to see all of the progress in Rust. I agree
>>>>> with getting the code released and I'm looking forward to testing it out!
>>>>>
>>>>> On Wed, Jan 31, 2024 at 12:35 AM Xuanwo  wrote:
>>>>>
>>>>>> We have been working on the iceberg-rust project for a while.
>>>>>> Although there is still much work to be done, I believe it is important 
>>>>>> to
>>>>>> release it in order to attract more users and developers to join us.
>>>>>>
>>>>>> At current stage, we have implemented basic features that users need
>>>>>> to read a table. It could be a good start!
>>>>>>
>>>>>> On Wed, Jan 31, 2024, at 14:46, Renjie Liu wrote:
>>>>>>
>>>>>> Hi, everyone:
>>>>>>
>>>>>> iceberg-rust <https://github.com/apache/iceberg-rust/> has been
>>>>>> under active development for several months, and it now has several
>>>>>> features, so I want to use this thread to discuss delivering the first
>>>>>> release of this crate.
>>>>>>
>>>>>> Why this first release 0.2.0?
>>>>>>
>>>>>> Before iceberg-rust <https://github.com/apache/iceberg-rust/> was
>>>>>> developed, there were already efforts trying to develop a rust version of
>>>>>> iceberg, and it has been registered in crates.io under the name
>>>>>> `iceberg`. We contacted the authors and they kindly transferred the
>>>>>> ownership of this crate to apache, so that we can use `iceberg` as our
>>>>>> crate's name. But due to the immutability of crate.io
>>>>>> <https://crates.io/>'s package, we can only start with version 0.2.0.
>>>>>>
>>>>>> Where are we?
>>>>>>
>>>>>> Currently we have delivered following features:
>>>>>>
>>>>>> 1. Rest catalog, including manipulating namespaces, load table,
>>>>>> create table, etc.
>>>>>> 2. Serialization/deserialization of table metadata, manifests.
>>>>>> 3. Documentation for this crate: https://rust.iceberg.apache.org/
>>>>>>
>>>>>> What's next?
>>>>>>
>>>>>> Eventually we will reach feature parity with java/python api, so that
>>>>>> we can bring full feature support of the iceberg to rust ecosystems. For
>>>>>> details of feature status, please check the `README.md` in github
>>>>>> repo <https://github.com/apache/iceberg-rust/> .
>>>>>>
>>>>>> Xuanwo
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>


Re: [VOTE] Release Apache PyIceberg 0.6.0rc4

2024-02-10 Thread Fokko Driesprong
Hi Justin, Dan,

Thanks for checking this.

For the Avro one, we copied parts of the decompression and binary decoder
for the internal PyIceberg implementation (that reads from an Iceberg
schema, rather than from an Avro schema). I checked the Avro NOTICE, and
there isn't anything relevant. I noticed that it was also a bit outdated
.

For the Thrift and Hive ones, we have an optional dependency that ships the
content under the vendor/ directory:
https://github.com/apache/iceberg-python/tree/main/vendor These are the
Python Thrift code for talking Thrift to the Hive metastore. Both the Hive
and Thrift NOTICES are empty, we can add the "This product includes
software developed at the Apache Software Foundation...", but that seems to
be optional
. I've
raised a PR here to add these anyway
.

Kind regards,
Fokko


Op zo 11 feb 2024 om 01:24 schreef Daniel Weeks :

> I think Fokko will need to weigh in on the avro usage.
>
> However, I don't think the notice for thrift or hive needs to be
> included.  The license file states that the project uses the thrift
> definitions, but it does not include them in the project or dependencies.
> There are no direct artifacts from those projects.  Since nothing is
> bundled from those projects, I don't believe this applies.
>
> -Dan
>
> On Sat, Feb 10, 2024 at 4:14 PM Justin Mclean 
> wrote:
>
>> Hi,
>>
>> Sure, the LICENSE file says it includes code from other ASF projects i.e.
>> Avro, Thrift and Hive. All of these have NOTICE files [1][2][3]. The
>> content from those files (or rather a subset) needs to be included in this
>> releases NOTICE fle.
>>
>> Kind Regards,
>> Justin
>>
>> 1. https://github.com/apache/avro/blob/main/NOTICE.txt
>> 2. https://github.com/apache/thrift/blob/master/NOTICE
>> 3. https://github.com/apache/hive/blob/master/NOTICE
>
>


Re: [DISCUSS] iceberg-rust 0.2.0 release

2024-02-10 Thread Fokko Driesprong
Hey Renjie,

That would be great. I'm happy to do the committer/PMC side of things.

Let's coordinate on the release tracking issue:
https://github.com/apache/iceberg-rust/issues/180

Kind regards,
Fokko



Kind regards,
Fokko Driesprong

Op wo 7 feb 2024 om 03:36 schreef Xuanwo :

> Absolutely, we need someone at the level of an iceberg committer or,
> ideally, a PMC member to assist with this release.
>
> I'm more than ready to help make any necessary code changes if needed.
>
> On Wed, Feb 7, 2024, at 10:24, Renjie Liu wrote:
>
> I’m happy to run, but the left parts (creating tags, uploads artifacts,
> etc) requires more permissions, so it would be great that someone else
> could volunteer to do that.
>
> On Wed, Feb 7, 2024 at 01:07 Fokko Driesprong  wrote:
>
> Hey everyone,
>
> Thanks for the great responses here. It looks like we can move this
> forward. I've just merged bumping the version to 0.2.0
> <https://github.com/apache/iceberg-rust/pull/181>. Is anyone interested
> in running this release?
>
> Kind regards,
> Fokko
>
> Op do 1 feb 2024 om 02:39 schreef Renjie Liu :
>
> Thanks everyone for the discussion.
>
> Since we have reached consensus on this, I'm happy to run the first
> release.
>
>
> On Thu, Feb 1, 2024 at 1:41 AM Daniel Weeks 
> wrote:
>
> +1 for 0.2.0 release as well.  Really excited about the progress here.
>
> -Dan
>
> On Wed, Jan 31, 2024 at 9:36 AM Fokko Driesprong  wrote:
>
> I'm all for the 0.2.0 release. Kudos to all the work so far. While the
> functionality is limited today, a lot of things are already in progress and
> it looks very promising. Also, running a release now will help to
> streamline the release process.
>
> Kind regards,
> Fokko Driesprong
>
> Op wo 31 jan 2024 om 17:28 schreef Jack Ye :
>
> Excited about the progress in Rust! +1 for releasing 0.2.0
>
> -Jack
>
> On Wed, Jan 31, 2024 at 8:26 AM Ryan Blue  wrote:
>
> Thanks, Renjie! It's great to see all of the progress in Rust. I agree
> with getting the code released and I'm looking forward to testing it out!
>
> On Wed, Jan 31, 2024 at 12:35 AM Xuanwo  wrote:
>
>
> We have been working on the iceberg-rust project for a while. Although
> there is still much work to be done, I believe it is important to release
> it in order to attract more users and developers to join us.
>
> At current stage, we have implemented basic features that users need to
> read a table. It could be a good start!
>
> On Wed, Jan 31, 2024, at 14:46, Renjie Liu wrote:
>
> Hi, everyone:
>
> iceberg-rust <https://github.com/apache/iceberg-rust/> has been under
> active development for several months, and it now has several features, so
> I want to use this thread to discuss delivering the first release of this
> crate.
>
> Why this first release 0.2.0?
>
> Before iceberg-rust <https://github.com/apache/iceberg-rust/> was
> developed, there were already efforts trying to develop a rust version of
> iceberg, and it has been registered in crates.io under the name
> `iceberg`. We contacted the authors and they kindly transferred the
> ownership of this crate to apache, so that we can use `iceberg` as our
> crate's name. But due to the immutability of crate.io <https://crates.io/>'s
> package, we can only start with version 0.2.0.
>
> Where are we?
>
> Currently we have delivered following features:
>
> 1. Rest catalog, including manipulating namespaces, load table, create
> table, etc.
> 2. Serialization/deserialization of table metadata, manifests.
> 3. Documentation for this crate: https://rust.iceberg.apache.org/
>
> What's next?
>
> Eventually we will reach feature parity with java/python api, so that we
> can bring full feature support of the iceberg to rust ecosystems. For
> details of feature status, please check the `README.md` in github repo
> <https://github.com/apache/iceberg-rust/> .
>
> Xuanwo
>
>
>
> --
> Ryan Blue
> Tabular
>
> Xuanwo
>
>


Re: [VOTE] Release Apache PyIceberg 0.6.0rc4

2024-02-11 Thread Fokko Driesprong
That makes sense. I've updated the PR:
https://github.com/apache/iceberg-python/pull/410/ PTAL.

Kind regards,
Fokko

Op zo 11 feb 2024 om 03:58 schreef Justin Mclean :

> HI,
>
> For the Thrift and Hive ones, we have an optional dependency that ships
> the content under the vendor/ directory:
> https://github.com/apache/iceberg-python/tree/main/vendor These are the
> Python Thrift code for talking Thrift to the Hive metastore. Both the Hive
> and Thrift NOTICES are empty, we can add the "This product includes
> software developed at the Apache Software Foundation...", but that seems
> to be optional
> .
>
>
> Yes, there is no need to repeat, "This product includes software developed
> at The Apache Software Foundation (http://www.apache.org/).” It only
> needs to be included once.
>
> Kind Regards,
> Justin
>


[VOTE] Release Apache Iceberg Rust 0.2.0 RC1

2024-02-15 Thread Fokko Driesprong
 Hello, Apache Iceberg Rust Community,

This is a call for a vote to release Apache Iceberg Rust version 0.2.0.

The tag to be voted on is 0.2.0-rc.1.

This first release provides integration with the REST catalog and a lot of
scaffolding that's needed for reading the data.

The release candidate:

https://dist.apache.org/repos/dist/dev/iceberg/iceberg-rust-0.2.0-rc.1/

Keys to verify the release candidate:

https://downloads.apache.org/iceberg/KEYS

Git tag for the release:

https://github.com/apache/iceberg-rust/releases/tag/v0.2.0-rc.1

Please download, verify, and test.

The VOTE will be open for at least 72 hours and until the necessary
number of votes are reached.

[ ] +1 approve
[ ] +0 no opinion
[ ] -1 disapprove with the reason

To learn more about Apache Iceberg, please see
https://rust.iceberg.apache.org/

Checklist for reference:

[ ] Download links are valid.
[ ] Checksums and signatures.
[ ] LICENSE/NOTICE files exist
[ ] No unexpected binary files
[ ] All source files have ASF headers
[ ] Can compile from source

More detailed checklist please refer to:
https://github.com/apache/iceberg-rust/tree/main/scripts

To compile from the source, please refer to:
https://github.com/apache/iceberg-rust/blob/main/CONTRIBUTING.md

Huge thanks to: Amogh Jahagirdar, Chengxu Bian, Christian Daudt, Farooq
Qaiser, JanKaul, Manu Zhang, Mark Grey, Renjie Liu, Tyler Schauer, Xiaoyang
Liu, Xuanwo, ZENOTME, barronw, hiirrxnn, y0psolo, yi wang, zhjwpku and of
course dependabot[bot] for working on this first release!

Here is a Python script in release to help you verify the release candidate:

./scripts/verify.py

Please consider this my +1 (binding) vote. I've ran the license checks and
tested against the REST catalog, and it worked like a charm. Code can be
found here: https://github.com/Fokko/hello-iceberg/blob/master/src/main.rs

Thanks, Fokko


Java Iceberg 2.0: Hadoop upgrade

2024-02-16 Thread Fokko Driesprong
Hi everyone,

I want to discuss adding the Hadoop upgrade to the list after moving to
Iceberg 2.0. We still compile against Hadoop 2.7.3 to ensure we support as
many users as possible. Hadoop 2.7.3 was released August 2016
 and is not maintained anymore
 for a long time.

My main reason for doing the upgrade is that on the Parquet MR project,
I've been pushing back the Hadoop upgrade to ensure compatibility with
Iceberg. However, at some point, we have to pull the trigger here. This
will simplify things on the Parquet side and avoid having to check if the
Java API Exists and such.

Since Hadoop 3.3+ officially supports Java 11
,
I would suggest dropping everything below that. I wanted to check on the
mailing list if there are any thoughts and or concerns.

Kind regards,
Fokko


Re: [VOTE] Release Apache PyIceberg 0.6.0rc6

2024-02-19 Thread Fokko Driesprong
+1 (binding)

I've checked signatures and checksums, checked the licenses, and did some
checks around writing.

Kind regards,
Fokko

Op ma 19 feb 2024 om 03:07 schreef Amogh Jahagirdar :

> +1 non-binding
> Verified signatures, checksum, and license
> Ran unit/integ tests on Python 3.10.4
> Ran ad-hoc tests w/ Rest Catalog
>
> Thanks all,
> Amogh Jahagirdar
>
> On Sun, Feb 18, 2024 at 4:32 PM Hussein Awala  wrote:
>
>> +1 (non-binding)
>>
>> - Tested the new writing feature with a non-partitioned table
>> - Created a non-partitioned table using PyArrow Schema
>> - Tested the new MacOS arm wheel
>>
>> All looks good!
>>
>> On Mon, Feb 19, 2024 at 1:20 AM Honah J.  wrote:
>>
>>> +1 (non-binding)
>>>
>>> - Verified signatures and checksums
>>> - Verified license
>>> - Ran unit tests and integration tests
>>>
>>> Best regards,
>>> Honah
>>>
>>> On Sun, Feb 18, 2024 at 2:54 PM Daniel Weeks 
>>> wrote:
>>>
 +1 (binding)

 Verified sigs/sums/license/tests (python 3.11)

 Also ran local tests against Hive and REST catalogs using
 appends/overwrites.

 -Dan

 On Sun, Feb 18, 2024 at 1:37 PM Ryan Blue  wrote:

> +1 (binding)
>
> * Checked checksum, signature, recent license changes
> * Built and tested in Python 3.10
> * Ran CLI checks against a REST catalog
>
> Ryan
>
> On Thu, Feb 15, 2024 at 7:55 AM Uwe L. Korn  wrote:
>
>> Hello all,
>>
>> just wanted to give a heads-up that I started publishing the release
>> candidates on a separate channel on conda-forge:
>> https://github.com/conda-forge/pyiceberg-feedstock/tree/rc. Thus, if
>> you want to test them out with conda/micromamba, you can get them also 
>> via
>>
>> conda -c conda-forge/label/pyiceberg_rc install pyiceberg
>>
>> or
>>
>> micromamba -c conda-forge/label/pyiceberg_rc install pyiceberg
>>
>> Best,
>> Uwe
>>
>> On Wed, Feb 14, 2024, at 3:16 PM, Sung Yun (BLOOMBERG/ 120 PARK)
>> wrote:
>>
>> Hi Everyone,
>>
>> We are moving onto the next RC with some important fixes. This RC
>> includes:
>>
>> * Bug Fix in passing configuration through environment variables #423
>> * Arm wheels #416
>> * Correction to the NOTICE and LICENSE #413
>>
>> Again, here's a summary of the high level features included in this
>> release:
>>
>> * Write support for writing to unpartitioned tables
>> * Includes snapshot generation
>> * Constructing Avro writer trees
>> * Support writing metadata which allows to commit support for the
>> Hive, Sql, and Glue catalog.
>> * Support for name-mapping
>> * Easy evolution of schema using the union_by_name method
>> * Support for creating unpartitioned tables using PyArrow Schema
>>
>> The commit ID is cc449266e7fe0e97f23e61b3c732b75a0d0a8dec
>>
>> * This corresponds to the tag: pyiceberg-0.6.0rc6
>> (a6cf17c301561595f7cfa497a1df1ec49e682a3b)
>> *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.0rc6
>> *
>> https://github.com/apache/iceberg-python/tree/cc449266e7fe0e97f23e61b3c732b75a0d0a8dec
>>
>> The release tarball, signature, and checksums are here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.6.0rc6/
>>
>> You can find the KEYS file here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on pypi:
>>
>> https://pypi.org/project/pyiceberg/0.6.0rc6/
>>
>> And can be installed using: pip3 install pyiceberg==0.6.0rc6
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>> [ ] +1 Release this as PyIceberg 0.6.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>



Re: [VOTE] Release Apache Iceberg Rust 0.2.0 RC1

2024-02-20 Thread Fokko Driesprong
his line
> <https://github.com/DevinR528/cargo-sort/blob/55ec89082466f6bb246d870a8d56d166a8e1f08b/src/main.rs#L56>
>  ,
> e.g. since the path(/home/blue/tmp/apache-iceberg-rust-0.2.0-src)
> contains '.', it thinks it's a file without checking, and then it tries to
> read it as a file, and failed. A simple workaround is to rename
> apache-iceberg-rust-0.2.0-src to sth without dots, and it works. I'll
> report a bug for cargo sort.
>
>
> On Tue, Feb 20, 2024 at 8:54 AM Ryan Blue  wrote:
>
>> +1 (binding)
>>
>>- Checked signature and checksum
>>- Ran the license check using docker run -it --rm -v
>>$(pwd):/github/workspace apache/skywalking-eyes header check
>>(I found this in the release.sh script)
>>- Verified .licenserc.yaml, LICENSE, and NOTICE
>>- Spot checked occurrences of ‘[Ff]rom’, ‘http’, and ‘[Cc]opied’ in
>>source to check for undocumented copied code (none)
>>- Compiled and tested in 1.75.0 using make test
>>- Built with cargo build --release
>>- Ran several makefile checks
>>
>> Non-blocking issues:
>>
>>- The REST catalog was unable to resolve icebergdata.minio causing 2
>>test failures. I had to switch over to local FS to run tests or else
>>rest_catalog_test cases test_create_table and test_update_table would 
>> fail.
>>I suspect this is a docker problem because there is a link in the
>>docker-compose.yaml file to provide that alias
>>- The LICENSE file doesn’t contain any third-party code
>>documentation. That’s fine if there isn’t any copied code in the whole
>>project, but seems a little suspicious. Copying code is fairly common.
>>Please help us make sure code taken from other places is properly
>>documented!
>>- The release script creates the tarball with git archive — which is
>>good — but doesn’t specify the specific files to include so you get
>>everything, including .gitignore, .github, .asf.yaml, and others that
>>aren't needed. I prefer being explicit about what is included to minimize
>>unnecessary files.
>>- make check failed in a cargo sort command. It looks like this is
>>not intended to work in a release tarball?
>>
>>cargo sort -c -w
>>error: no file found at: /home/blue/tmp/apache-iceberg-rust-0.2.0-src
>>make: *** [Makefile:33: cargo-sort] Error 1
>>
>>
>>
>> On Mon, Feb 19, 2024 at 11:00 AM Jack Ye  wrote:
>>
>>> +1 (binding)
>>>
>>> Verified checksum, signature, license, note, ASF header
>>> Ran build and test
>>> Checked no unexpected binary files
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Mon, Feb 19, 2024 at 2:33 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> +1 (non binding)
>>>>
>>>> I checked:
>>>> - checksum and signature are correct
>>>> - ASF headers are there (not in the tsv files but not a problem)
>>>> - no binary found in the source distribution
>>>>
>>>> Good improvement for next releases: update NOTICE file to mention non
>>>> ASF dependencies (listed in DEPENDENCIES.rust.tsv) with summary of the
>>>> licenses. I will propose a PR about that.
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>> On Thu, Feb 15, 2024 at 1:52 PM Fokko Driesprong 
>>>> wrote:
>>>> >
>>>> > Hello, Apache Iceberg Rust Community,
>>>> >
>>>> > This is a call for a vote to release Apache Iceberg Rust version
>>>> 0.2.0.
>>>> >
>>>> > The tag to be voted on is 0.2.0-rc.1.
>>>> >
>>>> > This first release provides integration with the REST catalog and a
>>>> lot of scaffolding that's needed for reading the data.
>>>> >
>>>> > The release candidate:
>>>> >
>>>> >
>>>> https://dist.apache.org/repos/dist/dev/iceberg/iceberg-rust-0.2.0-rc.1/
>>>> >
>>>> > Keys to verify the release candidate:
>>>> >
>>>> > https://downloads.apache.org/iceberg/KEYS
>>>> >
>>>> > Git tag for the release:
>>>> >
>>>> > https://github.com/apache/iceberg-rust/releases/tag/v0.2.0-rc.1
>>>> >
>>>> > Please download, verify, and test.
>>>> >
>>>> > The VOTE will be open for at least 72 hours and

Re: [VOTE] Release Apache Iceberg 1.5.0 RC0

2024-02-20 Thread Fokko Driesprong
Just using this thread to come back to the NOTICE discussion. This came
also up with the latest Python release, and I spent quite a bit of time on
it.

If it's "used" section is not strictly required in NOTICE from a legal
> perspective, the embedded dependencies should be mentioned (either
> under the Apache license as soon as they are not a ASF project),
> that's the "are not satisfied by either the text of LICENSE or the
> presence of licensing information embedded within the bundled
> dependency" part of the policy.


The source of truth that I follow is the ASF how-to-guide
.

By embedded, I mean distributed in the source distribution but also in
> binary distributions (as soon as we publish/distribute it).


The term in the how-to guide is bundling. For me, this means that when code
is packaged in a Java fat jar, and redistributed under the name of Iceberg.

For instance, here https://github.com/apache/karaf/blob/main/NOTICE
> you can see the included software (used software is not strictly
> required).


I think this conflicts with the guide as it states:

Do not add anything to NOTICE which is not legally required.


This will add a burden to anyone who wants to redistribute Iceberg because
they have to check the notices that are not legally required to bubble up
in their notice. Not required notices are mentioned in the LICENSE
 file where
attribution to the original author is given.

This is how I interpret the legalese from the how-to guide after going
through it for PyIceberg. I think we should follow the guide, and this also
avoids having to keep the NOTICE file up to date.

Kind regards,
Fokko



Op di 20 feb 2024 om 11:06 schreef Ajantha Bhat :

> Thanks Eduard,
>
> I will share a new RC info with the fix.
>
> - Ajantha
>
> On Tue, Feb 20, 2024 at 12:17 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Ryan,
>>
>> If it's "used" section is not strictly required in NOTICE from a legal
>> perspective, the embedded dependencies should be mentioned (either
>> under the Apache license as soon as they are not a ASF project),
>> that's the "are not satisfied by either the text of LICENSE or the
>> presence of licensing information embedded within the bundled
>> dependency" part of the policy.
>>
>> By embedded, I mean distributed in the source distribution but also in
>> binary distributions (as soon as we publish/distribute it).
>>
>> For instance, here https://github.com/apache/karaf/blob/main/NOTICE
>> you can see the included software (used software is not strictly
>> required).
>>
>> Regards
>> JB
>>
>> On Mon, Feb 19, 2024 at 5:52 PM Ryan Blue  wrote:
>> >
>> > JB,
>> >
>> > Can you help me understand your rationale for updating NOTICE? We are
>> strict about what goes into the NOTICE file to comply with ASF guidance:
>> >
>> > The NOTICE file is reserved for a certain subset of legally required
>> notifications which are not satisfied by either the text of LICENSE or the
>> presence of licensing information embedded within the bundled dependency.
>> > …
>> > It is important to keep NOTICE as brief and simple as possible, as each
>> addition places a burden on downstream consumers.
>> >
>> > Do not add anything to NOTICE which is not legally required.
>> >
>> > It sounds like the content you’re talking about would be better located
>> in the README instead.
>> >
>> > Ryan
>> >
>> >
>> > On Mon, Feb 19, 2024 at 2:27 AM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> +1 (non binding)
>> >>
>> >> I checked:
>> >> - checksum and signature are correct
>> >> - ASF headers are OK
>> >> - no binary found in the source distribution
>> >> - build is OK from the source distribution
>> >>
>> >> To be improved for next releases (not blocker at all):
>> >> - NOTICE file should mention dependencies and tools used (not
>> >> necessary included). I'm thinking about openapi, palantir plugins, aws
>> >> sdk, jackson, ... I will do a PR about that.
>> >> - doap.rdf file can be updated as part of the RC
>> >>
>> >> Thanks !
>> >> Regards
>> >> JB
>> >>
>> >> On Mon, Feb 19, 2024 at 11:02 AM Ajantha Bhat 
>> wrote:
>> >> >
>> >> > Hi Everyone,
>> >> >
>> >> > I propose that we release the following RC as the official Apache
>> Iceberg 1.5.0 release.
>> >> >
>> >> > The commit ID is bff665278245128a71982ba5ac5981a9e71c4509
>> >> > * This corresponds to the tag: apache-iceberg-1.5.0-rc0
>> >> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc0
>> >> > *
>> https://github.com/apache/iceberg/tree/bff665278245128a71982ba5ac5981a9e71c4509
>> >> >
>> >> > The release tarball, signature, and checksums are here:
>> >> > *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc0
>> >> >
>> >> > You can find the KEYS file here:
>> >> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> >> >
>> >> > Convenience binary artifacts are staged on Nexus. The Maven
>> repository URL is:
>> >> > *
>> http

Re: Gravitino an Iceberg REST catalog service

2024-02-29 Thread Fokko Driesprong
Hey everyone,

Thanks for raising this. I think a test-jar would be a great first step.

We already maintain "service" considering JDBC, Hive, etc catalogs. REST
Catalog ref impl in Iceberg would be the sam.


What I think Ryan means by a service is having to maintain Postgres (JDBC
backend), Hive Metastore (Hive backend), etc. There is a lot to it to
properly scale these backends.

For PyIceberg we decided to build the examples backed by the SqlCatalog.
This can be both in memory or on a local dist (sqlite), of course, it has
limited parallelism, but makes it easy to give Iceberg a try. One of the
main motivations for doing it this way was that it doesn't require any
additional services. Running additional services would require having
JRE/Docker/etc being installed and potentially also an RDBMS backend to
persist the data.

Kind regards,
Fokko


Op vr 1 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré :

> Hi Ryan
>
> If we plan to reduce the number of catalogs (and I think it makes
> sense and I'm with you on that), we will need a impl/service in
> Iceberg for the REST Catalog API, else the users won't be able to use
> Iceberg "out of the box".
> We already maintain "service" considering JDBC, Hive, etc catalogs.
> REST Catalog ref impl in Iceberg would be the sam.
>
> So, in order to promote the REST Catalog API as the Catalog "unique"
> façade for Iceberg, I would be in favor of having a simple REST
> service in Iceberg.
> It would be the entry point for Iceberg users and they can use other
> REST catalogs depending on their needs (Gravitno, Tabular, ...).
>
> Regards
> JB
>
> On Fri, Mar 1, 2024 at 1:28 AM Ryan Blue  wrote:
> >
> > There is a reference implementation in the project, in the
> CatalogHandlers class. That implements REST requests using a catalog and
> returns REST responses. I believe this is what Gravatno relies on and I
> mentioned it above in the discussion about whether we should have a catalog
> service.
> >
> > Catalog tests also use catalog handlers, but use a simple HTTP wrapper
> to test the HTTP client. There is also a test class that accepts HTTP calls
> directly and also runs JSON serialization on requests and responses.
> >
> > So far, the Iceberg community has avoided maintaining a service. That
> brings in a lot of complications. So far, we’ve preferred to remain focused
> on providing a library that can be used to wire up something like a REST
> catalog, but not provide a runtime service.
> >
> > Ryan
> >
> >
> > On Thu, Feb 29, 2024 at 2:59 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Ajantha,
> >>
> >> Thanks for sharing your thoughts.
> >>
> >> It makes sense for Gravitino to be a TLP (after the incubation period)
> >> because Gravitino is "more" than an Iceberg catalog. It implements the
> >> Iceberg REST Catalog API, but it's also a metadata catalog/repo with
> >> additional features.
> >>
> >> That said, I agree with what you said:
> >> 1. We have the openapi yaml in the Iceberg project, but no reference
> >> implementation in the project itself. I think REST Catalog is a good
> >> approach as a "central" Catalog API because any Iceberg engine/layer
> >> could use this API (even if written in Python, rust, go, whatever),
> >> and it allows new use cases (like easily move data from an engine to
> >> another as the catalog API would be the same).
> >> 2. From an ASF standpoint, I would not talk about "subproject" but
> >> more repositories. The reason is because in terms of governance, it's
> >> still the Iceberg project (PMC member or committer has the same
> >> permission on all repositories in the Iceberg project, it's not
> >> possible to have a committer only on iceberg-rust for instance.
> >> Generally speaking, we should limit the number of subprojects.
> >> 3. I think it would be fair to have REST Catalog resources (openapi
> >> yaml + a ref impl) in a iceberg-catalog repository.
> >> 4. However, It's important to have a more global discussion within the
> >> community about Iceberg 2.0 and the roadmap about catalogs: do we
> >> deprecate Iceberg Java Catalog API in favor of the REST Catalog API ?
> >> What do we do with the existing catalogs ? etc. I think it's a fair
> >> discussion to have for Iceberg 2.0.
> >>
> >> It's an important discussion, community driven.
> >>
> >> Regards
> >> JB
> >>
> >> On Thu, Feb 29, 2024 at 9:44 AM Ajantha Bhat 
> wrote:
> >> >
> >> > I apologize for the delay in responding.
> >> >
> >> > I'm pleased to see the development of an open-source REST catalog
> implementation, and the potential transition of Gravitino to an ASF project
> is certainly promising.
> >> > But REST catalog server implementation will be a small part of
> Gravitino ASF project. Which has many other things along with the catalog?
> >> >
> >> > While I understand Iceberg's focus on the table format specification
> and its implementation,
> >> > I would like to propose the creation of a sub-project for the REST
> catalog server implementation under the Iceb

Re: [VOTE] Release Apache Iceberg 1.5.0 RC4

2024-03-01 Thread Fokko Driesprong
+1 (binding)

- Checked checksum and signature
- Ran a modified version of dbt-spark to take advantage of the views, and
it worked like a charm! 🥳

Cheers, Fokko

Op vr 1 mrt 2024 om 06:43 schreef Ajantha Bhat :

> Gentle reminder.
>
> On Wed, Feb 28, 2024 at 8:34 PM Eduard Tudenhoefner 
> wrote:
>
>> +1 (non-binding)
>>
>> * validated checksum and signature
>> * checked license docs & ran RAT checks
>> * ran build and tests with JDK11
>> * built new docker images and ran through
>> https://iceberg.apache.org/spark-quickstart/
>> * tested with Trino & Presto
>> * tested view support with Spark 3.5 + JDBC/REST catalog
>> * tested view behavior when creating/reading/dropping views from
>> Spark/Trino using the diff from
>> https://github.com/trinodb/trino/pull/19818
>>
>> Eduard
>>
>> On Wed, Feb 28, 2024 at 1:55 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> I checked:
>>> - Signature and checksum are OK
>>> - Build is OK on the source distribution
>>> - ASF headers are present
>>> - No binary file found in the source distribution
>>> - Tested on iceland (sample project) + trino and also JDBC Catalog
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Tue, Feb 27, 2024 at 1:16 PM Ajantha Bhat 
>>> wrote:
>>> >
>>> > Hi Everyone,
>>> >
>>> > I propose that we release the following RC as the official Apache
>>> Iceberg 1.5.0 release.
>>> >
>>> > The commit ID is e39ec185d7879c1a310769d33e0b1b6ad12486a9
>>> > * This corresponds to the tag: apache-iceberg-1.5.0-rc4
>>> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc4
>>> > *
>>> https://github.com/apache/iceberg/tree/e39ec185d7879c1a310769d33e0b1b6ad12486a9
>>> >
>>> > The release tarball, signature, and checksums are here:
>>> > *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc4
>>> >
>>> > You can find the KEYS file here:
>>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>> >
>>> > Convenience binary artifacts are staged on Nexus. The Maven repository
>>> URL is:
>>> > *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1158/
>>> >
>>> > Please download, verify, and test.
>>> >
>>> > Please vote in the next 72 hours.
>>> >
>>> > [ ] +1 Release this as Apache Iceberg 1.5.0
>>> > [ ] +0
>>> > [ ] -1 Do not release this because...
>>> >
>>> > Only PMC members have binding votes, but other community members are
>>> encouraged to cast
>>> > non-binding votes. This vote will pass if there are 3 binding +1 votes
>>> and more binding
>>> > +1 votes than -1 votes.
>>> >
>>> > - Ajantha
>>>
>>


New committer: Bryan Keller

2024-03-05 Thread Fokko Driesprong
Hi everyone,

The Project Management Committee (PMC) for Apache Iceberg has invited Bryan
Keller to become a committer and we are pleased to announce that he has
accepted.

Bryan was contributing to Iceberg before it was even open-source, did a lot
of work on the topic of metadata generation, and is now leading the effort
of migrating the Kafka Connect integration into OSS Iceberg.

Being a committer enables easier contribution to the project since there is
no need to go via the patch submission process. This should enable better
productivity. A PMC member helps manage and guide the direction of the
project.

Please join me in congratulating Bryan.

Cheers,
Fokko


Re: [VOTE] Release Apache Iceberg 1.5.0 RC6

2024-03-08 Thread Fokko Driesprong
+1 (binding)

Thanks again for working on this Ajantha and Eduard.

- Checked checksum and signature
- Ran a modified version of dbt-spark to take advantage of the views and it
worked great!

Cheers, Fokko


Op za 9 mrt 2024 om 06:35 schreef Szehon Ho :

> +1 (binding)
>
> * Verified signature
> * Verified checksum
> * RAT check
> * built JDK 11
> * Ran basic tests on Spark 3.5
>
> Thanks
> Szehon
>
> On Fri, Mar 8, 2024 at 5:50 PM Amogh Jahagirdar  wrote:
>
>> +1 non-binding
>>
>> Verified signatures,checksums,RAT checks, build, and tests with JDK11. I
>> also ran ad-hoc tests for views in Trino with the rest catalog.
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Fri, Mar 8, 2024 at 5:04 PM Ryan Blue  wrote:
>>
>>> +1 (binding)
>>>
>>> - Normal tarball verification
>>> - Read from my broken view successfully
>>>
>>> On Fri, Mar 8, 2024 at 3:07 PM Daniel Weeks  wrote:
>>>
 +1 (binding)

 Verified sigs/sums/license/build/tests (Java 17)

 -Dan

 On Thu, Mar 7, 2024 at 2:10 PM Hussein Awala  wrote:

> +1 (non-binding)
> - checked checksum and signature
> - built from source with jdk11
> - tested read and write with Spark 3.5.1 and Glue catalog
>
> All looks good
>
> On Thu, Mar 7, 2024 at 10:49 PM Drew  wrote:
>
>> +1 (non-binding)
>>
>> - verified signature and checksum
>> - verified RAT license check
>> - verified build/tests passing with JDK17
>> - ran some manual tests on Spark3.5 with GlueCatalog
>>
>> Drew
>>
>> On Thu, Mar 7, 2024 at 4:38 AM Ajantha Bhat 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> * validated checksum and signature
>>> * checked license docs & ran RAT checks
>>> * ran build and tests with JDK11
>>> * *verified view support for Nessie catalog with Spark 3.5.*
>>> * *verified this RC against Trino
>>> (https://github.com/trinodb/trino/pull/20957
>>> )*
>>>
>>> - Ajantha
>>>
>>>
>>> On Wed, Mar 6, 2024 at 7:25 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 (non binding)

 - checksums and signatures are OK
 - ASF headers are present
 - No unexpected binary files in the source distribution
 - Build OK with JDK11
 - JdbcCatalog tested on Trino and Iceland
 - No unexpected artifact distributed

 Thanks !

 Regards
 JB

 On Wed, Mar 6, 2024 at 12:04 AM Ajantha Bhat 
 wrote:
 >
 > Hi Everyone,
 >
 > I propose that we release the following RC as the official Apache
 Iceberg 1.5.0 release.
 >
 > The commit ID is 2519ab43d654927802cc02e19c917ce90e8e0265
 > * This corresponds to the tag: apache-iceberg-1.5.0-rc6
 > *
 https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc6
 > *
 https://github.com/apache/iceberg/tree/2519ab43d654927802cc02e19c917ce90e8e0265
 >
 > The release tarball, signature, and checksums are here:
 > *
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc6
 >
 > You can find the KEYS file here:
 > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
 >
 > Convenience binary artifacts are staged on Nexus. The Maven
 repository URL is:
 > *
 https://repository.apache.org/content/repositories/orgapacheiceberg-1161/
 >
 > Please download, verify, and test.
 >
 > Please vote in the next 72 hours.
 >
 > [ ] +1 Release this as Apache Iceberg 1.5.0
 > [ ] +0
 > [ ] -1 Do not release this because...
 >
 > Only PMC members have binding votes, but other community members
 are encouraged to cast
 > non-binding votes. This vote will pass if there are 3 binding +1
 votes and more binding
 > +1 votes than -1 votes.
 >
 > - Ajantha

>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>


New committer: Renjie Liu

2024-03-08 Thread Fokko Driesprong
Hi everyone,

The Project Management Committee (PMC) for Apache Iceberg has invited
Renjie Liu to become a committer and we are pleased to announce that he has
accepted. We're very excited to have Renjie as a committer as he's leading
the effort of bringing Iceberg to the Rust world.

Being a committer enables easier contribution to the project since there is
no need to go via the patch submission process. This should enable better
productivity. A PMC member helps manage and guide the direction of the
project.

Please join me in congratulating Renjie.

Cheers,
Fokko


Re: [ANNOUNCE] Apache Iceberg release 1.5.0

2024-03-12 Thread Fokko Driesprong
Thanks for running the release Ajantha. It is great to see view
support being released on the Java side 🎉 Thanks everyone for the hard
work in making this release happen! Including all our new contributors
!

Kind regards,
Fokko



Op di 12 mrt 2024 om 18:53 schreef Yufei Gu :

> Congrats! Thanks for working on this!
>
> Yufei
>
>
> On Mon, Mar 11, 2024 at 6:15 PM Ajantha Bhat 
> wrote:
>
>> I'm pleased to announce the release of Apache Iceberg 1.5.0!
>>
>> Apache Iceberg is an open table format for huge analytic datasets. Iceberg
>> delivers high query performance for tables with tens of petabytes of data,
>> along with atomic commits, concurrent writes, and SQL-compatible table
>> evolution.
>>
>> This release can be downloaded from:
>> https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.5.0/apache-iceberg-1.5.0.tar.gz
>>
>> Release notes: https://iceberg.apache.org/releases/#150-release
>>
>> Java artifacts are available from Maven Central.
>>
>> Thanks to everyone for contributing!
>>
>> - Ajantha
>>
>


Re: [DISCUSS] Iceberg board report - March 2024

2024-03-12 Thread Fokko Driesprong
Thanks Ryan,

That looks comprehensive, thanks for taking the time to compile the report.
I have a few suggestions for the release section:

   - Name the releases by name: Python → PyIceberg. If people want to look
   it up, just googling the name will bring them to it directly.
   - Split the releases out by language, and include PyIceberg 0.5.1
   (30-10-2023).
   - I would add that we want to shorten the release cycle of PyIceberg
   substantially to get features out quicker to the community now the activity
   on the repository has rapidly increased in the last few weeks.

Apart from that, it looks good to me!

Kind regards,
Fokko


Op di 12 mrt 2024 om 19:02 schreef Ryan Blue :

> Hi everyone,
>
> Here’s my draft for Iceberg’s ASF board report. If you have anything to
> add, please reply!
>
> Ryan
> Description:
>
> Apache Iceberg is a table format for huge analytic datasets that is
> designed
> for high performance and ease of use.
> Project Status:
>
> Current project status: Ongoing
> Issues for the board: None
> Membership Data:
>
> Apache Iceberg was founded 2020-05-19 (4 years ago)
> There are currently 27 committers and 16 PMC members in this project.
> The Committer-to-PMC ratio is roughly 7:4.
>
> Community changes, past quarter:
>
>- No new PMC members. Last addition was Szehon Ho on 2023-04-20.
>- Bryan Keller was added as committer on 2024-03-02
>- Honah J. was added as committer on 2024-01-11
>- Renjie Liu was added as committer on 2024-03-06
>
> Project Activity:
>
> Releases:
>
>- Java 1.5.0 was released on 2024-03-11
>- Rust 0.2.0 was released on 2024-02-20 (first release!)
>- Python 0.6.0 was released on 2024-02-19
>- Java 1.4.3 was released on 2023-12-27
>
> Java implementation:
>
>- 1.5.0 is the first release supporting Iceberg Views
>- Added View resolution support in Spark engine integration
>- Added View commands to Spark (SHOW/CREATE/DROP/etc.)
>- View support in Trino is unblocked by the 1.5.0 release
>- Added View support to REST, Nessie, and JDBC catalogs
>- Discussing Materialized View extensions to Iceberg specs
>- Added EncryptingFileIO to minimize encryption-related API changes
>- Added StandardEncryptionManager to implement Iceberg Encryption spec
>- Added Parquet (native) and Avro (AES GCM) encryption support
>- Added pagination to listing in the REST catalog protocol
>- Discussing multiple extensions to the REST protocol (appends,
>planning)
>- Added delete file cache to Spark
>- Added support for Flink 1.18
>- Removed support for Spark 3.2
>
> Python implementation
>
>- 0.6.0 is the first release supporting native writes
>- Append and full table overwrite are supported
>- Only writes to unpartitioned tables are supported
>- Added commit support to JDBC, Glue, and Hive catalogs
>- Implemented name mapping support for reading Parquet files without
>field IDs
>- Actively working on writes to partitioned tables and engine
>integration
>
> Rust implementation:
>
>- 0.2.0 is the first Rust release
>- Supports reading metadata files
>- Supports REST catalog interaction
>- Scan planning is the next active area of work
>
> Documentation:
>
>- Switched to new site build in the iceberg repository so contributing
>is easier
>
> Community Health:
>
> The Iceberg community continues to be healthy. Although commit and PR
> activity
> declined, the metrics indicate that activity was still strong (with 70
> contributors and nearly 1,000 commits). This quarter also included holidays
> (which usually have decreased activity) and a huge increase in mailing list
> traffic (60%) because the community has been having many design discussions
> about evolving the REST spec, introducing new specs (materialized views),
> and
> discussions around how to keep track of new design proposals.
>
> The community also started organizing an Iceberg Summit, to be held May
> 14-15.
> The summit has been cleared by trademarks and the call for proposals has
> been
> posted. More information can be found at:
>
>- The Iceberg Summit website: https://iceberg-summit.org/
>- The Call for Proposals: https://sessionize.com/iceberg-summit-2024/
>
> --
> Ryan Blue
>


Re: [DISCUSS] What do we plan for Iceberg 2.0.0 ?

2024-03-13 Thread Fokko Driesprong
Hey JB,

Thanks for raising this. Sorry for the late reply, but I was OOO last week.
I think in general the progress is being kept on the spec itself
. Also, some features
are already available (default values in Python, and nanosecond timestamps
are being worked on in Java), and I would rather expose these features
already using a feature flag, rather than waiting for the spec to be
finalized. It would be nice to finalize the Spec at some point, to allow
engines that support Iceberg to say that we support up to Spec v3.

* Data Injection (e.g. Kafka Connect sink)


I'd rather organize these integrations bottom-up than top-down. We only
want to ensure that similar solutions are being developed in parallel. For
Kafka Connect it is part of the Iceberg repository, but it makes more sense
to push this to the project itself (for example in Beam
) when possible. With Hive 4.0.0
the Iceberg integration will also be moved to the Hive side, so that's also
a good opportunity to remove it from the Iceberg repository.

We have this page https://iceberg.apache.org/roadmap/. I'm not sureit's
> actually up to date.


It is very outdated
,
and I believe it is best to remove it for now
 (for now). Every project is
adopting the V3 spec already (default values in PyIceberg, nanoseconds in
Java).

I also proposed this https://github.com/apache/iceberg/pull/9666 to give a
> rough idea.


We're almost doing a release (roughly) every quarter and I agree it is good
to establish that as a cadence. I've left a small comment on the PR.

That's a raw discussion start, I propose to create a GitHub "Discussion"
> issue (flagged with 2.0.0 milestone) for each topic where we have consensus.


There is already a 2.0.0 milestone
, and we should use it to
indicate that we want to get into 2.0.0. I'm open to creating a Discussion
issue if more people think this is a good idea (typically this was
discussed on the mailing list within the ASF context).

Thanks,
Fokko

Op ma 11 mrt 2024 om 07:34 schreef Jean-Baptiste Onofré :

> Hi folks,
>
> I forgot to provide some background about this thread. The reason for
> this thread is because I think it's important to give visibility to
> our community, not necessarily with strong dates, but more about when
> roughly what could be expected. Without this, it's pretty hard for our
> users to define their own roadmap.
>
> We have this page https://iceberg.apache.org/roadmap/. I'm not sure
> it's actually up to date.
> I also proposed this https://github.com/apache/iceberg/pull/9666 to
> give a rough idea.
>
> So I think it would be good to have a consensus about the roadmap and
> update roadmap page on the website to have some visibility (it would
> be helpful for us too :)).
>
> Thoughts ?
>
> Regards
> JB
>
> On Thu, Mar 7, 2024 at 7:43 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Ryan
> >
> > Yeah I agree to separate discussions on each topic. Actually that was my
> intention ;)
> >
> > I just wanted to have thoughts from everyone about roadmap/timeline.
> >
> > Jack and I will start a dedicated thread about REST catalog.
> >
> > Thanks !
> >
> > Regards
> > JB
> >
> >
> > Le jeu. 7 mars 2024 à 18:34, Ryan Blue  a écrit :
> >>
> >> Hi JB,
> >>
> >> Specs and libraries are versioned separately. In fact, the v2 spec has
> already been voted on and adopted. The next spec version is v3.
> >>
> >> I think we do want to get to a 2.0 of the Java library sometime soon to
> drop some deprecated APIs and clean up a few things, but I don't think that
> we're quite ready to take that on right now, which is likely why there has
> been little activity on this thread.
> >>
> >> I also think that most of these things are going to be discussion
> points that we cover as separate topics, rather than one big "everything
> 2.0" thread. It just doesn't seem manageable to me to cover them all at
> once. Maybe that's just me though.
> >>
> >> Ryan
> >>
> >> On Thu, Mar 7, 2024 at 7:49 AM Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> Let me ping again on this thread ;)
> >>>
> >>> I think it would be great to give some visibility to the community,
> >>> especially about Spec v3 and Iceberg 2.0.0.
> >>>
> >>> Any comments about Spec V2 / Iceberg 2.0.0 ?
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On Fri, Feb 16, 2024 at 4:52 PM Jean-Baptiste Onofré 
> wrote:
> >>> >
> >>> > Hi guys,
> >>> >
> >>> > During the last community meeting, we started to quickly discuss
> Iceberg 2.0.
> >>> > I was quite surprised it came during the community meeting because I
> >>> > don't remember having a previous discussion (on the mailing list)
> >>> > about that.
> >>> >
> >>> > So, I would like to have to start an open discussi

Re: [PROPOSAL] Improvement on our PR flows

2024-03-20 Thread Fokko Driesprong
Hey everyone,

This is a gentle bump from my end on this thread since I like the idea.
Several people have already approved Dan's PR
<https://github.com/apache/iceberg/pull/9932/> about formalizing the
proposal process. Are there any questions or concerns from the PMC before
adopting this?

Kind regards,
Fokko Driesprong

Op wo 13 mrt 2024 om 13:17 schreef Renjie Liu :

> Hi, JB:
>
> Your proposal looks great to me. We should definitely have a vote for a
> proposal impacting the spec, and the model is great.
>
> On Tue, Mar 12, 2024 at 10:55 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> I think a vote would be necessary only if we don't have consensus on a
>> proposal. If anyone is OK with the proposal (no clear "concern" in the
>> doc and/or the GitHub issue), a vote is not required.
>> That said, any proposal impacting a spec should be voted (as part of
>> the spec proposal).
>>
>> I think it's fair to identify a proposal vote as a "code modification"
>> vote.
>> It means that it follows this model: a negative vote constitutes a
>> veto , which the voting group (generally the PMC of a project) cannot
>> override. Again, this model may be modified by a lazy consensus
>> declaration when the request for a vote is raised, but the full-stop
>> nature of a negative vote does not change. Under normal (non-lazy
>> consensus) conditions, the proposal requires three positive votes and
>> no negative votes in order to pass; if it fails to garner the
>> requisite amount of support, it doesn't. Then the proposer either
>> withdraws the proposal or modifies the code and resubmits it, or the
>> proposal simply languishes as an open issue until someone gets around
>> to removing it.
>>
>> We can link to https://www.apache.org/foundation/voting.html.
>>
>> Regards
>> JB
>>
>> On Tue, Mar 12, 2024 at 2:21 AM Renjie Liu 
>> wrote:
>> >
>> > Hi, Daniel:
>> >
>> > Thanks for this summary.
>> >
>> > I think one thing missing is that do we need a vote for the proposal to
>> be accepted or rejected? If required, what should the voting process be?
>> >
>> > On Tue, Mar 12, 2024 at 9:04 AM Daniel Weeks  wrote:
>> >>
>> >> Hey everyone, I synced up with JB about the proposal process and
>> wanted to see if we could make some initial progress.
>> >>
>> >> Based on some of the earlier discussions, we want to leverage as much
>> of the informal process as possible, but improve discoverability and a
>> little structure.  This probably means using github for tracking, google
>> docs where possible for the early proposal implementation comments, and the
>> dev list for discussion threads, awareness and voting.
>> >>
>> >> That said, I propose we adopt the following:
>> >>
>> >> 1. A simple issue template for initiating a proposal and applying a
>> 'proposal' label to the issue
>> >> 2. Use a github search link to document current proposals (based on
>> the 'proposal' label)
>> >> 3. Continue using google docs for proposals documentation/comments
>> (referenced from the github issue)
>> >> 4. Continue to create DISCUSS threads on the dev list for communication
>> >> 4. Backfill current proposals by creating issues for them
>> >>
>> >> I've created this PR to capture the initial template and docs.
>> >>
>> >> I think we want to introduce this with as little overhead as
>> possible.  Please follow up with questions/comments so we can close this
>> out.
>> >>
>> >> Thanks,
>> >> Dan
>> >>
>> >>
>> >> On Sun, Mar 10, 2024 at 11:30 PM Jean-Baptiste Onofré 
>> wrote:
>> >>>
>> >>> Hi Manu
>> >>>
>> >>> Yup, it's on my TODO. Thanks for the reminder, I will be back on this
>> >>> one this week :)
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On Mon, Mar 11, 2024 at 4:07 AM Manu Zhang 
>> wrote:
>> >>> >
>> >>> > Hi JB,
>> >>> >
>> >>> > Are you still working on this nice proposal?
>> >>> >
>> >>> > Regards,
>> >>> > Manu
>> >>> >
>> >>> > On Thu, Jan 4, 2024 at 3:35 PM Fokko Driesprong 
>> wrote:
>> >>> >>
>

Re: [VOTE] Release Apache PyIceberg 0.6.1rc1

2024-04-04 Thread Fokko Driesprong
+1 (binding)

- Checked the signature and the checksum
- Ran the example notebooks against 0.6.1rc1

- Did some checks locally and looks all good!

Thanks Honah for running the release!

Kind regards,
Fokko

Op do 4 apr 2024 om 17:56 schreef Honah J. :

> Hi Justin,
>
> Thanks for reviewing the release. There were some discussions about the
> NOTICE file in the 0.6.0 release: PR#410
>  and PR#413
> . Here are the reasons
> why the following projects have not been included in the NOTICE:
> (quoted from the PR)
>
>- Avro: Since we don't bundle the code, but just took some part of it,
>and want to attribute the author.
>- Thrift/Hive: We don't bundle any code, but just take the
>Python-compiled Thrift definitions.
>
> Does the above explanation address your concern? Please let me know if you
> have any other questions.
>
> Best regards,
> Honah
>
> On Thu, Apr 4, 2024 at 1:43 AM Justin Mclean 
> wrote:
>
>> Hi,
>>
>> I took a look at this, and the NOTICE file doesn't include the required
>> information from the included Apache projects NOTICE files [1]
>>
>> Kind Regards,
>> Justin
>>
>> 1. https://infra.apache.org/licensing-howto.html#alv2-dep
>>
>


Re: [VOTE] Release Apache PyIceberg 0.6.1rc1

2024-04-05 Thread Fokko Driesprong
Hey everyone,

First of all thanks for all the votes.

Regarding the discussion around the NOTICE. We all agree that when
something is bundled, it needs to be added to the notice. However, Laynes
Law of Debate  comes into play: what's the
definition of bundling? To expand on #413
:

   - Hive/Thrift: We don't include anything in the release that comes from
   Thrift or Hive directly. We use Thrift to compile the definition Hive
   definitions to Python, and those are included
   . I believe
   an update to the NOTICE is not required here.
   - Avro: We took the de/encoders as a starting point for our Avro
   implementation. This code was modified making it work with an Iceberg
   schema, rather than an Avro schema. The LICENSE was updated to attribute
   the original authors. I can see that this is considered bundling.

This is the problem when people copy what other projects are doing. In
> general, it is right, but sometimes, it is not. Also, you may not
> understand why it was done in a certain way. I frequently have to point
> this out to Incubating projects. I hate to have to say to an Incubating
> project not to follow this project's example.


It would be great to update the licensing to explicitly mention how to
handle borrowed  code.
Currently, it is unclear since every project does it differently. For
example, Hive, where the NOTICE is empty
. If you search in the
repository for the word borrowed
, on
the first page you already see code from HBase

and Hadoop
.
I would love to see the how-to being updated to explicitly mention how to
handle borrowed code.

Kind regards,
Fokko

Op vr 5 apr 2024 om 09:08 schreef Justin Mclean :

> Hi,
>
> > I think you are right, some 3rd parties (including Apache projects)
> > are missing in the NOTICE file (you and I already mentioned that in a
> > previous release). We should at least mention this. I pointed Apache
> > Karaf NOTICE as example.
>
> This is the problem when people copy what other projects are doing. In
> general, it is right, but sometimes, it is not. Also, you may not
> understand why it was done in a certain way. I frequently have to point
> this out to Incubating projects. I hate to have to say to an Incubating
> project not to follow this project's example.
>
> > I propose to not block releases due to that (as it's like this for a
> > while) and propose to PR to fix that and discuss/document why the
> > change.
>
> I'm not on the PMC, so even if I did vote, it wouldn't count. I would not
> treat it as a blocker for this release, but it would be great if it could
> be fixed before the next release.
>
> Kind Regards,
> Justin


Re: Looking for help with Pyflink and Iceberg

2024-04-10 Thread Fokko Driesprong
Hey Frank,

Thanks for reaching out here. I spent some cycles a while ago to remove the
Hadoop requirement from Flink. There were a lot of APIs that needed to
change, which caused not to follow through with it. But this might help you
in getting PyFlink up and running since it contains an example similar to
what you're trying to do: https://github.com/apache/iceberg/pull/7369

Let me know if this helps.

Kind regards,
Fokko

Op di 9 apr 2024 om 20:38 schreef Frank :

> Hey folks.  I apologize if this isn't the place, but I'm really struggling
> to put together a proper config/example that utilizes pyflink and our
> organizations managed Iceberg. There are bits and pieces of helpful
> examples in the Flink, Iceberg and pyflink docs but nothing I can get to
> work with our setup.  Our datalake team uses a Postgress Database for
> managing metadata and an S3 compatible store for the files.
>
> I'm struggling both with the overloaded language used in the documentation
> and resolving jar file dependencies. Its unclear to me whether I need to
> create a catalog in my pyflink runtime and use that in queries, or whether
> I can configure the connection in a different way so I can query the
> existing catalogs/tables. Its also unclear to me based on the errors I'm
> getting, whether I have all the proper jar files in my flink setup for use
> with pyflink. In my trial and error with the config, I seem to
> oscillate between errors related to missing classes in the underlying java
> code, or errors related to not finding the configured catalog/table.
>
> Has anyone on the list used Pyflink together with iceberg and a jdbc
> catalog implementation similar to ours?  Anyone know of useful example
> pyflink code that does something similar to the below? I'd love to pick
> your brain.
>
> env_settings = EnvironmentSettings.in_streaming_mode()
>
> table_env = TableEnvironment.create(environment_settings=env_settings)
>
> table_env.execute_sql(f"""
>
> CREATE CATALOG flink_iceberg WITH (
>
> 'type'='iceberg',
>
> 'connector'='iceberg',
>
>
> 'catalog-impl'='org.apache.flink.connector.jdbc.catalog.JdbcCatalog',
>
> 'uri'='jdbc:comdb2://my-jdbc-connect-string',
>
>  io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
>
>  'warehouse'='s3a://my-s3-bucket/'
>
>  )
>
> """)
>
>
> table_env.use_catalog("flink_iceberg")
>
> result = table_env.sql_query("SELECT * FROM prometheus_prometheus")
>
> result.execute().print()
>
> Frank Gilroy
>
>


Re: [VOTE] Release Apache PyIceberg 0.6.1rc2

2024-04-17 Thread Fokko Driesprong
Hey everyone,

First of all, thanks Honah for running the release! +1 (binding) from my end

- I checked the signature, hashes, and licenses and all look good
.
- Ran some local tests.

Kind regards,
Fokko


Op di 16 apr 2024 om 05:55 schreef Honah J. :

> Hi Everyone,
>
> I propose that we release the following RC as the official PyIceberg 0.6.1
> release.
>
> This is a patch release due to the following bugs:
>
>- Fail to create version 1 table with non-empty partition-spec and
>sort-order 
>- Hive Catalog cannot create table with TimestamptzType field
>
>- Fail to read parquet file with special characters in column names
>
>- Hive Catalog commit consistency issue
>
>
> Smaller bugs also have been backported
> .
>
> The commit ID is 0161e5c6b9bea2b6cf47245efd8df85da2c3d9b0
>
> * This corresponds to the tag: pyiceberg-0.6.1rc2
> (139fdff1ff6cff97264a61db8e9ed9ee3520d6d2)
> * https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1rc2
> *
> https://github.com/apache/iceberg-python/tree/0161e5c6b9bea2b6cf47245efd8df85da2c3d9b0
>
> The release tarball, signature, and checksums are here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.6.1rc2/
>
> You can find the KEYS file here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on pypi:
>
> https://pypi.org/project/pyiceberg/0.6.1rc2/
>
> And can be installed using: pip3 install pyiceberg==0.6.1rc2
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
> [ ] +1 Release this as PyIceberg 0.6.1
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: [VOTE] Release Apache PyIceberg 0.6.1rc2

2024-04-17 Thread Fokko Driesprong
Thanks everyone for voting. And Dan, thanks for reporting the issue. I went
down the rabbit hole 🐇 It is being tracked here
<https://github.com/pypi/warehouse/issues/15749>, and a fix is inbound here
<https://github.com/pypi/warehouse/pull/15795/>. This issue should fix
itself because the version containing post1 was released on the 9th of
April, and a follow-up was done on the 10th
<https://pypi.org/project/docutils/0.21.1/#history>. Locally I have
installed an older version because the docutils>0.20.1 requires Python 3.9
or later, and I'm still on 3.8:

poetry show docutils
 name : docutils
 version  : 0.20.1
 description  : Docutils -- Python Documentation Utilities

required by
 - pytest-checkdocs >=0.15

As you can see it is luckily only being used by a dev-dependency. I'm able
to reproduce it locally using Python 3.10 (using Docker), and also tried
the workaround in the issue, and that fixes the problem
<https://github.com/python-poetry/poetry/issues/9293#issuecomment-2048205226>.
Another option is to exclude the bodged version of docutils explicitly, and
this will cause the resolved to skip it entirely:
https://github.com/apache/iceberg-python/pull/615 I think this is a good
workaround until the patch is being released. It is not likely that we run
into this again since only 0.004% of the releases have this post notation
in the filename that causes issue :) WDYT?

Kevin, looking at your error, that seems to be a different issue where pip
cannot be found. I would also not recommend to use poetry inside of a venv.
Poetry provides similar functionality to venv (poetry shell). Another thing
to take into consideration is that checking out the Github tag is different
than extracting the .tar.gz. On Github there is the poetry.lock file that
provides reproducable CI builds, and this is missing from the tar.gz (where
it will try to install the latest and greatest).

Kind regards,
Fokko Driesprong


Op do 18 apr 2024 om 04:21 schreef Kevin Liu :

> +1 (non binding)
>
> Downloaded specific commit from the repo, and ran both the Python tests
> and integration tests.
>
> Steps:
> ```
> git clone --depth=1 --branch pyiceberg-0.6.1rc2 g...@github.com:
> apache/iceberg-python.git
> python -m venv ./venv
> source ./venv/bin/activate
> make install
> make test
> make test-integration
> ```
>
> Also ran into the issue Dan mentioned, subsequent `make install` ran
> successfully. Here's the stack trace:
> ```
> Preparing build environment with build-system requirements
> poetry-core>=1.0.0, wheel, Cython>=3.0.0, setuptools
> Command
> ['/var/folders/f1/3_vzsn7x1jq9hszb3z9y6f0mgn/T/tmph283p6rj/.venv/bin/python',
> '/private/tmp/iceberg-python/venv/lib/python3.11/site-packages/virtualenv/seed/wheels/embed/pip-24.0-py3-none-any.whl/pip',
> 'install', '--disable-pip-version-check', '--ignore-installed',
> '--no-input', 'poetry-core>=1.0.0', 'wheel', 'Cython>=3.0.0', 'setuptools']
> errored with the following return code 2
>
> Output:
> /var/folders/f1/3_vzsn7x1jq9hszb3z9y6f0mgn/T/tmph283p6rj/.venv/bin/python:
> can't open file
> '/private/tmp/iceberg-python/venv/lib/python3.11/site-packages/virtualenv/seed/wheels/embed/pip-24.0-py3-none-any.whl/pip':
> [Errno 2] No such file or directory
>
> make: *** [install-dependencies] Error 1
> ```
>
> Thanks,
> Kevin
>
> On Wed, Apr 17, 2024 at 3:06 PM Daniel Weeks  wrote:
>
>> I tried running the verification process but ran into issues resolving
>> some of the dependencies:
>>
>> make install
>> Updating dependencies
>> Resolving dependencies... (3.1s)
>>
>> Package docutils (0.21.post1) not found.
>> make: *** [install-dependencies] Error 1
>>
>> I found this related issue
>> <https://github.com/python-poetry/poetry/issues/9293#issuecomment-2048205226>
>> which indicates pip is trying to install a "post release" version.
>>
>> This was with python 3.10 and pip 22.0.4
>>
>> I haven't been able to get the install to properly resolve the
>> dependencies
>>
>> -Dan
>>
>>
>> On Wed, Apr 17, 2024 at 2:10 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> I checked:
>>> - Hash and signature are good
>>> - LICENSE and NOTICE look good
>>> - No binary file found in the source distribution
>>> - Ran a few tests
>>>
>>> Regards
>>> JB
>>>
>>> On Tue, Apr 16, 2024 at 4:53 AM Honah J.  wrote:
>>> >
>>> > Hi Everyone,
>>> >
&

Re: [VOTE] Release Apache PyIceberg 0.6.1rc3

2024-04-18 Thread Fokko Driesprong
Thanks Honah for the quick follow-up with RC3.

+1 binding

- Ran  the
signatures, checksums, and licenses.
- Double-checked
 that it
installs from a clean Python 3.10 docker-container (the abovementioned
docutils issue)
- Ran some simple checks
 against
example notebooks

Kind regards,
Fokko

Op do 18 apr 2024 om 09:23 schreef Honah J. :

> Hi Everyone,
>
> I propose that we release the following RC as the official PyIceberg 0.6.1
> release.
>
> This is a patch release due to the following bugs:
>
>- Fail to create version 1 table with non-empty partition-spec and
>sort-order 
>- Hive Catalog cannot create table with TimestamptzType field
>
>- Fail to read parquet file with special characters in column names
>
>- Hive Catalog commit consistency issue
>
>- docutils=0.21 installation issue
>
>
> Smaller bugs also have been backported
> .
>
> The commit ID is 910dd783f16280b46704dd9679a4d003fb8a2e18
>
> * This corresponds to the tag: pyiceberg-0.6.1rc3
> (876a9fb3963ab0dc80485dedfee7cee2f4a8dd13)
> * https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1rc3
> *
> https://github.com/apache/iceberg-python/tree/910dd783f16280b46704dd9679a4d003fb8a2e18
>
> The release tarball, signature, and checksums are here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.6.1rc3/
>
> You can find the KEYS file here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on pypi:
>
> https://pypi.org/project/pyiceberg/0.6.1rc3/
>
> And can be installed using: pip3 install pyiceberg==0.6.1rc3
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
> [ ] +1 Release this as PyIceberg 0.6.1
> [ ] +0
> [ ] -1 Do not release this because...
>


Re: [VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-23 Thread Fokko Driesprong
Sorry for being late to the party!

+1 (binding)

- Checked checksum, signature and licenses

- Ran example notebooks


Kind regards,
Fokko


Op di 23 apr 2024 om 22:58 schreef Drew :

> +1 (non binding)
>
> * verified signature and checksums
> * verified RAT license check
> * verified build/tests passing with JDK17
> * ran some manual tests on Spark 3.5 with GlueCatalog
>
> - Drew
>
> On Mon, Apr 22, 2024 at 1:31 PM Szehon Ho  wrote:
>
>> +1 (binding)
>>
>> * Verify signature
>> * Verify checksum
>> * Verify licenses
>> * Build and run basic test with Spark 3.5
>>
>> Thanks
>> Szehon
>>
>> On Sun, Apr 21, 2024 at 11:45 PM Ajantha Bhat 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> * validated checksum and signature
>>> * checked license docs & ran RAT checks
>>> * ran build and tests with JDK11
>>>
>>> - Ajantha
>>>
>>> On Mon, Apr 22, 2024 at 2:49 AM Hussein Awala  wrote:
>>>
 +1 (non-binding)
 - checked signatures, checksums and licences
 - tested with Spark 3.5.1 and Glue and Hive catalogs

 On Sunday, April 21, 2024, Jean-Baptiste Onofré 
 wrote:

> +1 (non binding)
>
> I checked the fixes on JDBC Catzlog.
>
> Regards
> JB
>
> Le ven. 19 avr. 2024 à 01:07, Amogh Jahagirdar  a
> écrit :
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official Apache
>> Iceberg 1.5.1 release.
>>
>> The commit ID is cbb853073e681b4075d7c8707610dceecbee3a82
>> * This corresponds to the tag: apache-iceberg-1.5.1-rc0
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.1-rc0
>> *
>> https://github.com/apache/iceberg/tree/cbb853073e681b4075d7c8707610dceecbee3a82
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.1-rc0
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven
>> repository URL is:
>> * https://repository.apache.org/
>> 
>> content/repositories/
>> 
>> orgapacheiceberg-1162/
>> 
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 1.5.1
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> non-binding votes. This vote will pass if there are 3 binding +1
>> votes and more binding
>> +1 votes than -1 votes.
>>
>


Re: [ANNOUNCE] Apache PyIceberg release 0.6.1

2024-04-30 Thread Fokko Driesprong
Awesome! Thanks for running this release Honah 🙌

Kind regards,
Fokko

Op wo 1 mei 2024 om 06:48 schreef Honah J. :

> I'm pleased to announce the release of Apache PyIceberg 0.6.1!
>
> Apache Iceberg is an open table format for huge analytic datasets. Iceberg
> delivers high query performance for tables with tens of petabytes of data,
> along with atomic commits, concurrent writes, and SQL-compatible table
> evolution.
>
> This Python release can be downloaded from:
> https://pypi.org/project/pyiceberg/0.6.1/
>
> Thanks to everyone for contributing!
>


Re: [VOTE] Release Apache Iceberg 1.5.2 RC0

2024-05-02 Thread Fokko Driesprong
+1 (binding)

Thanks for going through this once more!

- Ran the signatures and checksums
- Checked the licenses
- Ran some sample checks with Spark 3.5 (Scala 2.12)

Kind regards,
Fokko

Op do 2 mei 2024 om 15:51 schreef Eduard Tudenhoefner :

> +1 (non-binding)
>
> * validated checksum and signature
> * checked license docs & ran RAT checks
> * ran build and tests with JDK11
>
> On Thu, May 2, 2024 at 9:47 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (non binding)
>>
>> I tested the JDBC Catalog fixes and the artifacts look good regarding
>> Scala versions.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Wed, May 1, 2024 at 7:25 PM Amogh Jahagirdar  wrote:
>> >
>> > Hi Everyone,
>> >
>> > I propose that we release the following RC as the official Apache
>> Iceberg 1.5.2 release.
>> >
>> > The commit ID is cbb853073e681b4075d7c8707610dceecbee3a82
>> > * This corresponds to the tag: apache-iceberg-1.5.2-rc0
>> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.2-rc0
>> > *
>> https://github.com/apache/iceberg/tree/cbb853073e681b4075d7c8707610dceecbee3a82
>> >
>> > The release tarball, signature, and checksums are here:
>> > *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.2-rc0
>> >
>> > You can find the KEYS file here:
>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> >
>> > Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> > *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1163/
>> >
>> > Please download, verify, and test.
>> >
>> > Please vote in the next 72 hours.
>> >
>> > [ ] +1 Release this as Apache Iceberg 1.5.2
>> > [ ] +0
>> > [ ] -1 Do not release this because...
>> >
>> > Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> > non-binding votes. This vote will pass if there are 3 binding +1 votes
>> and more binding
>> > +1 votes than -1 votes.
>>
>


Re: GitHub issue labels

2024-05-27 Thread Fokko Driesprong
Hey Manu,

I don't explicitly use the labels, but they help me to categorize the
issues mentally. I agree that there is room for improvement as there are
more issues being raised every day.

Other communities also have interesting approaches, such as:

   - Triage label: When a new bug, improvement, proposal, or question is
   being raised it gets a triage label to make sure that someone from the
   community looks at it. Assesses the severity of the issue, assesses the
   effort, etc. After an initial follow-up, the triage label can be removed,
   this can be done by either a committer or someone who has contributor
   rights.
   - Full process: Arrow takes this to the next level and has a full
   process for it: https://github.com/apache/arrow/pulls. There are many
   different issues to indicate the process: awaiting review, awaiting
   committer review, awaiting changes, awaiting change review, and probably a
   few more. For me, this feels a bit too much.

I think it might be good to add a triage label, and this might also be
included in a report. You could also easily see the newest issues by
filtering on this issue.

WDYT?

Kind regards,
Fokko


Op ma 27 mei 2024 om 18:03 schreef Manu Zhang :

> Hi all,
>
> Currently, a label, one of bug, improvement, proposal and question, is
> applied automatically
> to an issue if it's created from a template. However, I'm not sure we are
> actually making use of those labels. We have questions without answers,
> bugs without double checks and improvements without discussions.
>
> Do you think we can send a weekly report of those issues somewhere to get
> more attention from the community? I remember @Jean-Baptiste Onofré
>  having a similar proposal before.
>
> Thanks,
> Manu
>


Re: Addressing security questions in the Iceberg REST specification

2024-05-28 Thread Fokko Driesprong
Hey Robert,

Sorry for the late reply as I was out last week. I'm not an OAuth guru
either, but some context from my end.

* Credentials (for example username/password) must _never_ be sent to
> the resource server, only to the authorization server.


In an earlier discussion , it
was agreed that the resource server can also function as the
authorization server. But the roles can also be separate.

1.2. As long as OAuth2 is the only mechanism supported by the Iceberg
> client, make the existing client parameter “oauth2-server-uri”
> mandatory. The Iceberg REST catalog must fail to initialize if the
> “oauth2-server-uri” parameter is not defined.


It can also be that there is no authentication in the case of an internal
REST catalog. For example, the iceberg-rest-image
 that we use for
integration tests in PyIceberg.

We think that Apache Iceberg REST Catalog spec should not mandate that a
> catalog implementation responds to requests to produce Auth Tokens
> (since the REST spec v1 defines a /v1/tokens endpoint, current
> implementations have to take deliberate actions when responding to those
> requests, whether with successful token responses or with “access
> denied” or “unsupported” responses).

The `/v1/tokens` endpoint is optional

.

* Credentials (for example username/password) must _never_ be sent to
> the resource server, only to the authorization server.


I fully agree!

Even if an Iceberg REST server does not implement the ‘/v1/oauth/tokens’
> endpoint, it can still receive requests to ‘/v1/oauth/tokens’ containing
> clear text credentials, if clients are misconfigured (humans do make
> mistakes) - it’s a non-zero risk - bad actors can implement/intercept
> that  ‘/v1/oauth/tokens’ endpoint and just wait for misconfigured
> clients to send credentials.


I think the wording is chosen badly. It should not send any credentials,
but the code (as in this example
 by GCS).

I think Jack makes a good point with AWS SigV4 Authentication. I suppose,
> in REST Catalog implementations that support that auth method, the
> /v1/oauth/token Catalog REST endpoint is redundant.
>

There are other cloud providers next to AWS.

Kind regards,
Fokko



Op do 23 mei 2024 om 15:49 schreef Dmitri Bourlatchkov
:

> I think Jack makes a good point with AWS SigV4 Authentication. I suppose,
> in REST Catalog implementations that support that auth method, the
> /v1/oauth/token Catalog REST endpoint is redundant.
>
> Cheers,
> Dmitri.
>
> On Thu, May 23, 2024 at 9:20 AM Jack Ye  wrote:
>
>> I do not know enough details about OAuth to make comments about this
>> issue, but just regarding the statement "OAuth2 is the only mechanism
>> supported by the Iceberg client", AWS Sigv4 auth is also supported, at
>> least in the Java client implementation
>> .
>> It would be nice if we formalize that in the spec, at least define it as a
>> generic authorization header.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>> On Thu, May 23, 2024 at 2:51 AM Robert Stupp  wrote:
>>
>>> Hi all,
>>>
>>> Iceberg REST implementations, either accessible on the public internet
>>> or inside an organization, are usually being secured using appropriate
>>> authorization mechanisms. The Nessie team is looking at implementing the
>>> Iceberg REST specification and have some questions around the security
>>> endpoint(s) defined in the spec.
>>>
>>> TL;DR we have questions (potentially concerns) about having the
>>> ‘/v1/oauth/tokens’ endpoint, for the reasons explained below. We think
>>> that ‘/v1/oauth/tokens’ poses potential security and OAuth2 compliance
>>> issues, and imposes how authorization should be implemented.
>>> * As an open table format, it would be good for Iceberg to focus on the
>>> table format / catalog and not how authorization is implemented. The
>>> existence of an OAuth endpoint pushes implementations to adopt
>>> authorization using only OAuth, whereas the implementers might choose
>>> several other ways to implement authorization (e.g. SAML). In our
>>> opinion the spec should leave it open to the implementation to decide
>>> how authorization will be implemented.
>>> * The existence of that endpoint also pushes operators of Iceberg REST
>>> endpoints into the authorization service business.
>>> * Clients might expose their clear-text credentials to the wrong
>>> service, if the (correct) OAuth endpoint is not configured (humans do
>>> make mistakes).
>>> * (Naive) Iceberg REST servers may proxy requests received for
>>> ‘/v1/oauth/tokens’ - and effectively become a “man-in-the-middle”, which
>>> is not fully compliant with the OAuth 2.0 specification.
>

Re: Addressing security questions in the Iceberg REST specification

2024-05-31 Thread Fokko Driesprong
st,
>>> > Jack Ye
>>> >
>>> > On Wed, May 29, 2024 at 10:28 AM Steven Wu 
>>> wrote:
>>> >>
>>> >> Wondering if the auth endpoints can be separated out to a separate
>>> OpenAPI spec file. Then we still have some reference for interactions with
>>> auth server and make it clear it is not required as part of the REST
>>> catalog server. In most enterprise environments, auth server is likely a
>>> separate server.
>>> >>
>>> >> On Tue, May 28, 2024 at 1:25 PM Alex Dutra
>>>  wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>>>
>>> >>>> On point 4, isn't that possible today, Can't that be achieved with
>>> the current token exchange approach, and the internal implementation of the
>>> endpoint?
>>> >>>
>>> >>>
>>> >>> Unfortunately, no. Token exchange is not widely adopted yet: for
>>> example, Keycloak has only partial support for it, and Authelia, or
>>> Authentik, have no support for it at all.
>>> >>>
>>> >>> This, and a few other technical issues with the current internals of
>>> the REST client, makes it nearly impossible to achieve a good integration
>>> of Iceberg REST with the majority of popular OSS authorization servers.
>>> >>>
>>> >>> I am planning to start another email thread to discuss these
>>> practicalities, but let's first reach consensus on the broader security
>>> issues voiced here, before we tackle the details.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Alex Dutra
>>> >>>
>>> >>> On Tue, May 28, 2024 at 8:41 PM Amogh Jahagirdar 
>>> wrote:
>>> >>>>
>>> >>>> I disagree with removing "/v1/oauth/tokens" and I think I also
>>> disagree with the premise that implementing that endpoint is required, but
>>> I can understand how that's not clear in the spec. I think we can address
>>> the required vs non-required discussion with the capabilities PR.
>>> >>>>
>>> >>>> It seems like another part of what's driving this discussion is
>>> some concern around how do we enforce REST catalog implementations which do
>>> implement this endpoint to make sure that the implementation is secure (for
>>> example to avoid the MITM example that was brought up). This is ultimately
>>> a runtime detail. To me it seems like if we make it clear that such an
>>> endpoint should be implemented respecting OAuth2 standards, and we know
>>> that OAuth2 compliance requires avoiding that MITM situation, then runtime
>>> implementations should just follow the spec there
>>> >>>>
>>> >>>> >3. Enable flexibility for Iceberg REST servers to opt for other
>>> >>>> authorization mechanisms than OAuth 2.0.
>>> >>>> >4. Enable REST servers to opt for integrating with any standard
>>> OAuth2 /
>>> >>>> OIDC provider (e.g. Okta, Keycloak, Authelia).
>>> >>>>
>>> >>>> I agree with both of these points; again I don't think the
>>> intention is Oauth2 is the only way, but I think the capabilities PR will
>>> make that even more clear.
>>> >>>> On point 4, isn't that possible today, Can't that be achieved with
>>> the current token exchange approach, and the internal implementation of the
>>> endpoint? Sorry if I missed that explanation.
>>> >>>>
>>> >>>> Thanks,
>>> >>>>
>>> >>>> Amogh Jahagirdar
>>> >>>>
>>> >>>> On Tue, May 28, 2024 at 11:13 AM Yufei Gu 
>>> wrote:
>>> >>>>>
>>> >>>>> Not an expert on authentication, but reading from the context, I
>>> agree that it’s not a good practice to use a resource server as a token
>>> server. The resource server would need to securely handle and store
>>> credentials or tokens, increasing the risk of credential theft or leakage.
>>> Making the token endpoint optional will mitigate the issue a bit. But if we
>>> want to disable it completely, it's better to do it now to prevent any
>>> issues and migration costs in the future. Can we have a consensus on it?
>>> >>&

Re: [INFO] Preparing the Apache Iceberg 1.6.0 release

2024-06-12 Thread Fokko Driesprong
Hi JB, thanks for raising this.

- With the Gradle version update, we will be able to upgrade to Parquet
> 1.14.0

We might want to defer this until Parquet 1.14.1 gets released. There
is an issue
found with Jackson  that
prohibits Spark from upgrading to 1.14.0. It might be that this is also the
case with other query engines. Gang already started a DISCUSS thread
 on
running the minor release.

- Depending of the timing, I will include new Avro releases

I would love that! With the release it allows us to use the
BlockingDirectBinaryEncoder ,
which will allow efficient skipping of lists and maps
 in the file. This will speed
up operations like expire snapshots quite a bit because we can jump easily
over all the statistics.

It would be great if there were some Flink reviewers to get some eyes on
this PR to run Flink without Hadoop
. We have to change some APIs
and with the introduction of Flink 1.19 in Iceberg 1.16.0 this is a great
moment.

Kind regards,
Fokko

Op ma 10 jun 2024 om 00:24 schreef Jean-Baptiste Onofré :

> Hi folks,
>
> As discussed during the last community meeting, we are heading to the
> Apache Iceberg 1.6.0 release.
>
> The Iceberg 1.6.0 milestone is present for both GitHub Issues and PRs.
>
> I'm targeting major updates for this release:
> - the Kafka commit coordinator (PR is in review)
> - Revapi "fix" to be able to upgrade to latest Gradle version (I have
> one PR ready and I have another exploring a new option)
> - With the Gradle version update, we will be able to upgrade to Parquet
> 1.14.0
> - Depending of the timing, I will include new Avro releases
>
> Please, if you have anything you want to include in the 1.6.0 release,
> please let me know and create an issue on GitHub with the "Iceberg
> 1.6.0" milestone.
>
> Thanks !
>
> Regards
> JB
>


Re: Agenda Community Sync 19th June

2024-06-18 Thread Fokko Driesprong
Hey Jan,

Thanks for raising this. Let me jot down the highlights, and feel free to
add what you'd like to discuss. I'm personally looking forward to an update
on the materialized views.

Kind regards,
Fokko

Op di 18 jun 2024 om 20:28 schreef Jan Kaul :

> Hi all,
>
> I was wondering whether there was an agenda for the community sync
> tomorrow. There currently is no entry in the google doc.
>
> Best wishes,
>
> Jan
>
>


Re: Agenda Community Sync 19th June

2024-06-19 Thread Fokko Driesprong
Hey everyone,

Thanks for the input. I've collected everything in the notes
,
feel free to do suggestions or edits. Thanks Brian for running the
recording. Looking forward to seeing everyone later today!

Kind regards,
Fokko

Op wo 19 jun 2024 om 16:07 schreef Jean-Baptiste Onofré :

> Hi Brian,
>
> Thanks ! See you later today then.
>
> Regards
> JB
>
> On Wed, Jun 19, 2024 at 3:59 PM Brian Olsen 
> wrote:
> >
> > Hey all!
> >
> > So I just spoke with Fokko. I’ll be happy to hop on to continue the
> recordings and I still owe you all some sync notes from the last few
> meetings (those are still coming).
> >
> > I’m not sure if Ryan will be joining given the holiday but if anything
> Fokko will be back up.
> >
> > On Wed, Jun 19, 2024 at 8:27 AM Renjie Liu 
> wrote:
> >>
> >> Hi, all:
> >>
> >> I want to share progress about iceberg-rust and discuss about 0.3.0
> release.
> >>
> >> On Wed, Jun 19, 2024 at 9:07 PM Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> Thanks for your message.
> >>>
> >>> The document has been updated.
> >>>
> >>> I will provide two major updates from my side:
> >>> 1. Gradle update and revapi (both alternative and plugin fix)
> >>> 2. Iceberg Java 1.6.0 release preparation (including some dependency
> updates)
> >>>
> >>> As it's Juneteenth today, if US part of the community is not there, I
> >>> propose Fokko moderates/drives the meeting.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On Tue, Jun 18, 2024 at 8:28 PM Jan Kaul 
> wrote:
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I was wondering whether there was an agenda for the community sync
> >>> > tomorrow. There currently is no entry in the google doc.
> >>> >
> >>> > Best wishes,
> >>> >
> >>> > Jan
> >>> >
>


  1   2   3   >