from:"Kevin Liu"

Re: [VOTE] Release Apache PyIceberg 0.6.1rc2

2024-04-17 Thread Kevin Liu

+1 (non binding)

Downloaded specific commit from the repo, and ran both the Python tests and
integration tests.

Steps:
```
git clone --depth=1 --branch pyiceberg-0.6.1rc2 g...@github.com:
apache/iceberg-python.git
python -m venv ./venv
source ./venv/bin/activate
make install
make test
make test-integration
```

Also ran into the issue Dan mentioned, subsequent `make install` ran
successfully. Here's the stack trace:
```
Preparing build environment with build-system requirements
poetry-core>=1.0.0, wheel, Cython>=3.0.0, setuptools
Command
['/var/folders/f1/3_vzsn7x1jq9hszb3z9y6f0mgn/T/tmph283p6rj/.venv/bin/python',
'/private/tmp/iceberg-python/venv/lib/python3.11/site-packages/virtualenv/seed/wheels/embed/pip-24.0-py3-none-any.whl/pip',
'install', '--disable-pip-version-check', '--ignore-installed',
'--no-input', 'poetry-core>=1.0.0', 'wheel', 'Cython>=3.0.0', 'setuptools']
errored with the following return code 2

Output:
/var/folders/f1/3_vzsn7x1jq9hszb3z9y6f0mgn/T/tmph283p6rj/.venv/bin/python:
can't open file
'/private/tmp/iceberg-python/venv/lib/python3.11/site-packages/virtualenv/seed/wheels/embed/pip-24.0-py3-none-any.whl/pip':
[Errno 2] No such file or directory

make: *** [install-dependencies] Error 1
```

Thanks,
Kevin

On Wed, Apr 17, 2024 at 3:06 PM Daniel Weeks  wrote:

> I tried running the verification process but ran into issues resolving
> some of the dependencies:
>
> make install
> Updating dependencies
> Resolving dependencies... (3.1s)
>
> Package docutils (0.21.post1) not found.
> make: *** [install-dependencies] Error 1
>
> I found this related issue
> 
> which indicates pip is trying to install a "post release" version.
>
> This was with python 3.10 and pip 22.0.4
>
> I haven't been able to get the install to properly resolve the dependencies
>
> -Dan
>
>
> On Wed, Apr 17, 2024 at 2:10 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (non binding)
>>
>> I checked:
>> - Hash and signature are good
>> - LICENSE and NOTICE look good
>> - No binary file found in the source distribution
>> - Ran a few tests
>>
>> Regards
>> JB
>>
>> On Tue, Apr 16, 2024 at 4:53 AM Honah J.  wrote:
>> >
>> > Hi Everyone,
>> >
>> > I propose that we release the following RC as the official PyIceberg
>> 0.6.1 release.
>> >
>> > This is a patch release due to the following bugs:
>> >
>> > Fail to create version 1 table with non-empty partition-spec and
>> sort-order
>> > Hive Catalog cannot create table with TimestamptzType field
>> > Fail to read parquet file with special characters in column names
>> > Hive Catalog commit consistency issue
>> >
>> > Smaller bugs also have been backported.
>> >
>> > The commit ID is 0161e5c6b9bea2b6cf47245efd8df85da2c3d9b0
>> >
>> > * This corresponds to the tag: pyiceberg-0.6.1rc2
>> (139fdff1ff6cff97264a61db8e9ed9ee3520d6d2)
>> > *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1rc2
>> > *
>> https://github.com/apache/iceberg-python/tree/0161e5c6b9bea2b6cf47245efd8df85da2c3d9b0
>> >
>> > The release tarball, signature, and checksums are here:
>> >
>> > * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.6.1rc2/
>> >
>> > You can find the KEYS file here:
>> >
>> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> >
>> > Convenience binary artifacts are staged on pypi:
>> >
>> > https://pypi.org/project/pyiceberg/0.6.1rc2/
>> >
>> > And can be installed using: pip3 install pyiceberg==0.6.1rc2
>> >
>> > Please download, verify, and test.
>> >
>> > Please vote in the next 72 hours.
>> > [ ] +1 Release this as PyIceberg 0.6.1
>> > [ ] +0
>> > [ ] -1 Do not release this because...
>>
>

Re: [VOTE] Release Apache PyIceberg 0.6.1rc3

2024-04-18 Thread Kevin Liu

+1 nonbinding
- Checked the signatures, checksums, and licenses.
- Ran tests (`make test`, `make test-integration`)

I also found this page to be very helpful in learning how to verify a
release
https://py.iceberg.apache.org/verify-release/

Best,
Kevin Liu


On Thu, Apr 18, 2024 at 4:14 AM Fokko Driesprong  wrote:

> Thanks Honah for the quick follow-up with RC3.
>
> +1 binding
>
> - Ran <https://gist.github.com/Fokko/640018d6471656948783defd20c4f7f0>
> the signatures, checksums, and licenses.
> - Double-checked
> <https://gist.github.com/Fokko/ee94104a98742f4ca699bedccaf34a8f> that it
> installs from a clean Python 3.10 docker-container (the abovementioned
> docutils issue)
> - Ran some simple checks
> <https://github.com/tabular-io/docker-spark-iceberg/pull/142> against
> example notebooks
>
> Kind regards,
> Fokko
>
> Op do 18 apr 2024 om 09:23 schreef Honah J. :
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official PyIceberg
>> 0.6.1 release.
>>
>> This is a patch release due to the following bugs:
>>
>>- Fail to create version 1 table with non-empty partition-spec and
>>sort-order <https://github.com/apache/iceberg-python/pull/544>
>>- Hive Catalog cannot create table with TimestamptzType field
>><https://github.com/apache/iceberg-python/issues/583>
>>- Fail to read parquet file with special characters in column names
>><https://github.com/apache/iceberg-python/pull/597>
>>- Hive Catalog commit consistency issue
>><https://github.com/apache/iceberg-python/pull/607>
>>- docutils=0.21 installation issue
>><https://github.com/apache/iceberg-python/pull/615>
>>
>> Smaller bugs also have been backported
>> <https://github.com/apache/iceberg-python/milestone/5?closed=1>.
>>
>> The commit ID is 910dd783f16280b46704dd9679a4d003fb8a2e18
>>
>> * This corresponds to the tag: pyiceberg-0.6.1rc3
>> (876a9fb3963ab0dc80485dedfee7cee2f4a8dd13)
>> *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1rc3
>> *
>> https://github.com/apache/iceberg-python/tree/910dd783f16280b46704dd9679a4d003fb8a2e18
>>
>> The release tarball, signature, and checksums are here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.6.1rc3/
>>
>> You can find the KEYS file here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on pypi:
>>
>> https://pypi.org/project/pyiceberg/0.6.1rc3/
>>
>> And can be installed using: pip3 install pyiceberg==0.6.1rc3
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>> [ ] +1 Release this as PyIceberg 0.6.1
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>

Re: [ANNOUNCE] Apache PyIceberg release 0.6.1

2024-05-01 Thread Kevin Liu

Thanks for the release!

On Wed, May 1, 2024 at 2:47 AM Fokko Driesprong  wrote:

> Awesome! Thanks for running this release Honah 🙌
>
> Kind regards,
> Fokko
>
> Op wo 1 mei 2024 om 06:48 schreef Honah J. :
>
>> I'm pleased to announce the release of Apache PyIceberg 0.6.1!
>>
>> Apache Iceberg is an open table format for huge analytic datasets. Iceberg
>> delivers high query performance for tables with tens of petabytes of data,
>> along with atomic commits, concurrent writes, and SQL-compatible table
>> evolution.
>>
>> This Python release can be downloaded from:
>> https://pypi.org/project/pyiceberg/0.6.1/
>>
>> Thanks to everyone for contributing!
>>
>

Seattle Apache Iceberg Meetup

2024-05-08 Thread Kevin Liu

Hey folks,

We're starting this community meetup in Seattle, more below.


*Seattle Apache Iceberg Meetup *May 15th, 5 PM - 8 PM in the Seattle area

Come to meet and greet folks working with Apache Iceberg and Open Table
Formats! We are hosting a meetup in Seattle in honor of the inaugural Iceberg
Summit <https://iceberg-summit.org/> on May 14-15th. Attend to discuss
interesting ideas from the conference and bring your best Open Table Format
hot takes.

More information here
<https://www.meetup.com/na-apache-iceberg-meetups/events/30042/>.
Please RSVP via Google Form
<https://docs.google.com/forms/d/1pVwoyk5pNBF-bh7maNbw-V9MKR5w4fp2UJ2MW98_Z1E/edit>
Also, join #meetup-seattle slack channel for updates on future events.

Cheers,
Kevin Liu

Re: Seattle Apache Iceberg Meetup

2024-05-08 Thread Kevin Liu

https://lists.apache.org/thread/tw9bpqyh76vfw7284rooboomhzzx8xnd

On Wed, May 8, 2024 at 10:09 AM Kevin Liu  wrote:

> Hey folks,
>
> We're starting this community meetup in Seattle, more below.
>
>
> *Seattle Apache Iceberg Meetup *May 15th, 5 PM - 8 PM in the Seattle area
>
> Come to meet and greet folks working with Apache Iceberg and Open Table
> Formats! We are hosting a meetup in Seattle in honor of the inaugural Iceberg
> Summit <https://iceberg-summit.org/> on May 14-15th. Attend to discuss
> interesting ideas from the conference and bring your best Open Table Format
> hot takes.
>
> More information here
> <https://www.meetup.com/na-apache-iceberg-meetups/events/30042/>.
> Please RSVP via Google Form
> <https://docs.google.com/forms/d/1pVwoyk5pNBF-bh7maNbw-V9MKR5w4fp2UJ2MW98_Z1E/edit>
> Also, join #meetup-seattle slack channel for updates on future events.
>
> Cheers,
> Kevin Liu
>
>
>
>
>

Re: [VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-11 Thread Kevin Liu

+1 (non-binding)

Thanks,
Kevin Liu

On Thu, Jul 11, 2024 at 12:00 PM Szehon Ho  wrote:

> +1
>
> Thanks
> Szehon
>
> On Thu, Jul 11, 2024 at 11:02 AM Daniel Weeks  wrote:
>
>> +1 (binding)
>>
>> On Thu, Jul 11, 2024 at 10:54 AM Anurag Mantripragada
>>  wrote:
>>
>>> +1 (non-binding) .Thanks Steve
>>>
>>>
>>> Anurag Mantripragada
>>>
>>> On Jul 11, 2024, at 10:27 AM, Yufei Gu  wrote:
>>>
>>> +1 (binding) Thanks for doing this, Steven.
>>> Yufei
>>>
>>>
>>> On Thu, Jul 11, 2024 at 10:16 AM Amogh Jahagirdar <2am...@gmail.com>
>>> wrote:
>>>
>>>> + 1 (non-binding).
>>>>
>>>> Thanks,
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>> On Thu, Jul 11, 2024 at 10:25 AM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Thu, Jul 11, 2024, 17:31 Jack Ye  wrote:
>>>>>
>>>>>> +1 (binding)
>>>>>>
>>>>>> On Thu, Jul 11, 2024 at 3:37 AM Piotr Findeisen <
>>>>>> piotr.findei...@gmail.com> wrote:
>>>>>>
>>>>>>> it looks it's part of the spec that's not connected to the other
>>>>>>> parts of the spec (like "dead code")
>>>>>>>
>>>>>>> +1 (non binding)
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 11 Jul 2024 at 08:30, Eduard Tudenhöfner <
>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Thu, Jul 11, 2024 at 8:29 AM Ajantha Bhat 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> - Ajantha
>>>>>>>>>
>>>>>>>>> On Thu, Jul 11, 2024 at 11:02 AM Jean-Baptiste Onofré <
>>>>>>>>> j...@nanthrax.net> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non binding)
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 11, 2024 at 12:50 AM Steven Wu 
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Following the latest community guidelines, I would like to
>>>>>>>>>> start a voting thread on removing the JSON spec for content file and 
>>>>>>>>>> file
>>>>>>>>>> scan task. Here is the PR for the spec change [1]
>>>>>>>>>> >
>>>>>>>>>> > This was previously discussed in the dev mailing list [2].
>>>>>>>>>> While it is good to add the JSON serializer in iceberg-core for 
>>>>>>>>>> ContentFile
>>>>>>>>>> and FileScanTask, their JSON formats don't need to be added to the 
>>>>>>>>>> core
>>>>>>>>>> table spec.
>>>>>>>>>> >
>>>>>>>>>> > Please vote in the next 72 hours.
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Steven
>>>>>>>>>> >
>>>>>>>>>> > [1] https://github.com/apache/iceberg/pull/9771
>>>>>>>>>> > [2]
>>>>>>>>>> https://lists.apache.org/thread/2ty27yx4q0zlqd5h71cyyhb5k47yf9bv
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>
>>>

Seattle Apache Iceberg Meetup - July 2024

2024-07-14 Thread Kevin Liu

Hey folks,

The next Seattle area Apache Iceberg meetup will be on July 18th, 2024 from
5:00 PM to 8:00 PM. More information is available at
https://sites.google.com/view/icebergmeetup
Be sure to RSVP before the event!

Come for a night of networking and lively discussions. We also have
presentations by several folks in the community on various topics.
* Variant Data Type Support - Tyler Akidau @ Snowflake
* Change Data Capture (CDC) with Iceberg - Roy Hasson @ Upsolver
* Reviews of Apache Iceberg Proposals - Alex Merced @ Dremio

For updates on future events, please join the seattle-apache-iceberg-meetup
<https://groups.google.com/g/seattle-apache-iceberg-meetup/> Google Group
and the Iceberg Slack
<https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A>
#meetup-seattle
channel.

P.S. If you're interested in organizing similar events in your local area,
please feel free to reach out and I can connect you with the right folks.

Cheers,
Kevin Liu

Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-23 Thread Kevin Liu

Thank you, Fokko!

And thank you to everyone for contributing to a great open-source
community. It’s been a pleasure to be part of this project. Looking forward
to the future!

On Tue, Jul 23, 2024 at 8:59 AM Walaa Eldin Moustafa 
wrote:

> Congratulations everyone! Great to see the community growing.
>
> Thanks,
> Walaa.
>
> On Tue, Jul 23, 2024 at 8:51 AM Alex Dutra 
> wrote:
>
>> Congratulations to you all!
>>
>> Thanks,
>> Alex
>>
>> On Tue, Jul 23, 2024 at 5:30 PM Jack Ye  wrote:
>>
>>> Congratulations!!
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Tue, Jul 23, 2024 at 8:16 AM Ryan Blue 
>>> wrote:
>>>
>>>> Congratulations, everyone! And thanks for all your contributions!
>>>>
>>>> On Tue, Jul 23, 2024 at 8:06 AM Mehul Batra 
>>>> wrote:
>>>>
>>>>> Congratulations Everyone!
>>>>>
>>>>> On Tue, Jul 23, 2024 at 6:36 PM Fokko Driesprong 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> The Iceberg PMC is excited to announce new committers and PMC members
>>>>>> to the Apache Iceberg project.
>>>>>>
>>>>>> New committers:
>>>>>>
>>>>>>
>>>>>>-
>>>>>>
>>>>>>Kevin Liu (kevinjqliu)
>>>>>>-
>>>>>>
>>>>>>Piotr Findeisen (findepi)
>>>>>>-
>>>>>>
>>>>>>Sung Yun (syun64)
>>>>>>-
>>>>>>
>>>>>>Xuanwo (xuanwo)
>>>>>>
>>>>>>
>>>>>> New members of the PMC:
>>>>>>
>>>>>>
>>>>>>-
>>>>>>
>>>>>>Honah (honahx)
>>>>>>-
>>>>>>
>>>>>>Renjie Liu (liurenjie1024)
>>>>>>
>>>>>>
>>>>>> We’re very excited to see the project grow in many ways by supporting
>>>>>> new languages and setting new standards.
>>>>>>
>>>>>> Please join us in welcoming the new committers and PMC members!
>>>>>>
>>>>>> On behalf of the Iceberg PMC,
>>>>>>
>>>>>> Fokko
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>

Re: [ANNOUNCE] Apache PyIceberg release 0.7.0

2024-07-30 Thread Kevin Liu

Woot! Thank you, Sung, for managing the release! I'm very excited about
this new version.

I want to highlight the many contributors who have improved PyIceberg since
the last release. There have been 34 unique contributors (source
).
Also, as seen in issue #511
, the community has
come together to contribute various metadata table implementations to
PyIceberg.

Onwards and upwards,

Kevin

On Tue, Jul 30, 2024 at 12:08 PM Sung Yun  wrote:

> I'm pleased to announce the release of Apache PyIceberg 0.7.0!
>
> Once again, this large release includes the following features on a high
> level:
>
> * Write support to partitioned tables with IdentityTransform and
> TimeTransform partitions
> * Support for deletes using predicates. It will drop whole files when it
> is able to based on the Iceberg statistics, otherwise it will perform a
> copy-on-write.
> * Parallelizing writes for a given partition based on a target file size
> * A new API for rendering PyArrow tables that show metadata about the
> tables’ manifests, partitions, etc
> * Support for evolving table partitions
> * Updated schema compatibility check to be more permissive, by supporting
> promotable types and subset of schemas on write
> * Option to merge manifests on write when number of manifests exceeds a
> threshold
> * Support staging a table for creation and building a transaction
> * A new table scan API to return an Arrow RecordBatchReader as opposed to
> a fully materialized Arrow table
> * Support for categorical and large PyArrow types on write
> * A new API to add existing parquet files to a table without rewriting them
> * Support for loading custom catalog
>
> This Python release can be downloaded from:
> https://pypi.org/project/pyiceberg/0.7.0/
>
> Thank you everyone again for the amazing contributions and engagement
> since the last release.
>
> Sincerely,
> Sung
>

Re: [DISCUSS] Use iceberg-rust as pyiceberg file io

2024-08-06 Thread Kevin Liu

> First, we need to establish a workflow that allows us to gradually
integrate new features into pyiceberg-core. Additionally, pyiceberg should
be able to import and optionally use classes from pyiceberg-core in an
additive manner. While developing this workflow, our community will learn
how to collaborate, manage releases, and more.

+1 I would like to learn more about how to integrate pyiceberg-core into
PyIceberg. The initial setup should give us a framework for future
integrations.

I also think there's some prerequisite work on the Pyiceberg side to clean
up FileIO. A lot of features are built specifically with PyArrow
dependency, such as writing to the table.


Thanks,
Kevin


On Mon, Aug 5, 2024 at 12:19 AM Xuanwo  wrote:

> For the FileIO part, just curious—since Rust's FileIO currently also uses
> OpenDAL, will there be any functional differences in terms of supported
> storage services or configurations (like profile_name, signer, etc.)
> compared to using opendalfs directly in Python in the future? Will Rust's
> FileIO introduce any customizations/optimizations/extensions beyond what
> OpenDAL supports?
>
>
> Hi, Honah
>
> I believe there should be no functional differences. We can implement the
> exact same thing for both pyiceberg_core FileIO and opendalfs fsspec FileIO.
>
> The main difference I've noticed is in where the configuration parsing
> occurs.
>
> The pyiceberg_core FileIO directly exposes the FileIO class, which can
> inherently understand iceberg properties. We can pass these properties
> directly to initialize file IO without any additional effort on the
> pyiceberg side.
>
> However, for opendalfs fsspec FileIO, we need to parse the properties and
> convert them into appropriate opendalfs options for it to function properly.
>
> On Mon, Aug 5, 2024, at 15:04, Honah J. wrote:
>
> Thanks Xuanwo for driving this and everyone for discussing,
>
> I like the idea of pushing down low-level logic to Iceberg-rust
> (pyiceberg_core). It’s great to have another option besides PyArrow for
> reading and writing data in PyIceberg. Thanks, Xuanwo, for moving this
> forward with the initial PR to add pyiceberg_core.
>
> For the FileIO part, just curious—since Rust's FileIO currently also uses
> OpenDAL, will there be any functional differences in terms of supported
> storage services or configurations (like profile_name, signer, etc.)
> compared to using opendalfs directly in Python in the future? Will Rust's
> FileIO introduce any customizations/optimizations/extensions beyond what
> OpenDAL supports?
>
> Best regards,
> Honah
>
>
>
> On Sat, Aug 3, 2024 at 4:12 PM timog...@proton.me.INVALID
>  wrote:
>
> Fantastic work! I think this is a great direction, and this provides a
> good base to start iterating.
>
> It makes the most sense to me for the Python bindings (and others) to live
> in the same repo as iceberg-rust, especially at this early stage.
>
> - Tim O'Guin
>
>
>  Original Message 
> On 8/3/24 12:33 AM, Xuanwo wrote:
>
>
> Let's rock! Welcome to take a review:
> https://github.com/apache/iceberg-rust/pull/518
>
> On Sat, Aug 3, 2024, at 12:13, Xuanwo wrote:
>
> I also support integrating iceberg-rust with pyiceberg rather than
> building something new on OpenDAL.
>
> OpenDAL backed FileIO will be usable in Python once opendalfs[1], the
> native fsspec support for OpenDAL, is ready. Users can use opendalfs as a
> FileIO class directly in pure python. It's not an action item for our
> community to take.
>
> The consensus we've reached is that iceberg-rust will be the core of
> PyIceberg. The main question now is "How?" How can we implement it without
> disrupting our valued users? This is my top priority.
>
> *Naming is so hard! Let's refer to the new iceberg-rust based pyiceberg
> core as `pyiceberg-core` until we decide on a project name.*
>
> First, we need to establish a workflow that allows us to gradually
> integrate new features into pyiceberg-core. Additionally, pyiceberg should
> be able to import and optionally use classes from pyiceberg-core in an
> additive manner. While developing this workflow, our community will learn
> how to collaborate, manage releases, and more.
>
> We will then incorporate additional Rust-backed features into
> pyiceberg-core. Eventually, we may make pyiceberg-core our default
> implementation.
>
> My current plan is to implement this pyiceberg-core under iceberg-rust
> repo under `bindings/python`.
>
> - Iceberg-rust is currently under active development. I plan to release
> pyiceberg-core independently of iceberg-rust's release, as they feature
> distinct public APIs (and languages!).
> - Most of the work involves maintaining a few Python stubs and classes,
> with the majority related to Rust.
> - The python integration is just a start: we can expect `bindings/nodejs`
> to happen here too.
>
> The setup work has already been started. I will update my PR here once
> it's ready to review.
>
> [1]: https://github.com/fsspec/opendalfs

Re: [DISCUSS] PyIceberg 0.7.1 release

2024-08-06 Thread Kevin Liu

> Typically we only push patches into the minor versions, we could also go
to version 0.8.0 immediately.

The issues above sound like patches to me, fixing issues discovered during
the 0.7.0 release. Is there a reason to move to 0.8.0?

> I'm still on the fence regarding 17.0.0 upgrade. There are clear
functional upsides, but I feel that constraining PyIceberg to just one
published version would make the adoption of PyIceberg difficult for our
users.

+1 on this concern. Is it possible to make the Arrow 17.0.0 upgrade
optional first? So that folks who want the upgrade can test it out.

Thanks,
Kevin Liu



On Fri, Aug 2, 2024 at 11:33 AM Sung Yun  wrote:

> Hi Fokko,
>
> That makes sense, thank you for the suggestion! The issue was quite severe
> for us that we had to fork the repo and have a fix ourselves in order to
> run PyIceberg without our applications going OOM. So I think there will be
> value in getting the proposed config property out as early as possible for
> the larger community.
>
> I'm still on the fence regarding 17.0.0 upgrade. There are clear
> functional upsides, but I feel that constraining PyIceberg to just one
> published version would make the adoption of PyIceberg difficult for our
> users. Users writing new applications won't have trouble with it, but users
> intending to use PyIceberg in an existing application may have to upgrade
> their PyArrow versions which could be a deterrent (or a welcome nudge).
> Would it be worth starting that discussion on a separate thread?
>
> Sung
>
> On 2024/08/02 17:57:17 Fokko Driesprong wrote:
> > Hey Sung,
> >
> > Typically we only push patches into the minor versions, we could also go
> to
> > version 0.8.0 immediately.
> >
> > Regarding the memory consumption, thanks for putting those numbers
> > together! I would also love to get #929
> > <https://github.com/apache/iceberg-python/pull/929>, so we can push down
> > the large/small type to PyArrow (only for to_arrow), and apply #986
> > <https://github.com/apache/iceberg-python/pull/986> on top if you want
> to
> > force it to either small or large types.
> >
> > WDYT?
> >
> > Kind regards,
> > Fokko
> >
> >
> > Op vr 2 aug 2024 om 19:46 schreef Sung Yun :
> >
> > > Hi folks,
> > >
> > > We identified inefficient memory usage hikes with the current way of
> > > upcasting pyarrow types to large_ on read, when reading tables
> with
> > > certain characteristics. A detailed set of example benchmarks of this
> issue
> > > is on the google document linked on PR #986:
> > > https://github.com/apache/iceberg-python/pull/986
> > >
> > > The proposed solution introduces a config to override this behavior to
> use
> > > small types instead, and I'd like to add this into the patch release to
> > > give users better control over their memory usage.
> > >
> > > Also, this is just a gentle reminder that this DISCUSS thread is still
> > > open for any new issues that are identified from 0.7.0 release, that we
> > > should fix in the patch release.
> > >
> > > Thank you,
> > > Sung
> > >
> > > On 2024/07/30 23:57:04 Sung Yun wrote:
> > > > Hi folks,
> > > >
> > > > We are starting to compile the list of issues to fix and port into
> the
> > > > 0.7.1 release.
> > > >
> > > > The current list of known issues is as follows:
> > > >
> > > > Fix pydantic warning on table commit: #972
> > > > <https://github.com/apache/iceberg-python/pull/972> (thanks for the
> > > quick
> > > > fix ndrluis!)
> > > > Issue when rewriting an unpartitioned table: #979
> > > > <https://github.com/apache/iceberg-python/issues/979>
> > > > Issue when evolving and writing in the same transaction: #980
> > > > <https://github.com/apache/iceberg-python/issues/980>
> > > >
> > > > Please feel free to respond to this thread with any issues that
> should be
> > > > tracked for the patch release.
> > > >
> > > > Thank you!
> > > > Sung
> > > >
> > >
> >
>

Re: [DISCUSS] Use iceberg-rust as pyiceberg file io

2024-08-06 Thread Kevin Liu

I had some ideas to refactor the current FileIO implementations in
PyIceberg to consolidate the behaviors for FsSpec and PyArrow.
https://github.com/apache/iceberg-python/issues/310

There are also some additional concerns around URI parsing based on the
specific FileIO implementation.

Perhaps a good litmus test for a well-defined FileIO interface is to
introduce nanoarrow
<https://arrow.apache.org/nanoarrow/latest/getting-started/python.html> and
see what breaks.

Thanks,
Kevin Liu





On Tue, Aug 6, 2024 at 11:21 AM Kevin Liu  wrote:

> > First, we need to establish a workflow that allows us to gradually
> integrate new features into pyiceberg-core. Additionally, pyiceberg should
> be able to import and optionally use classes from pyiceberg-core in an
> additive manner. While developing this workflow, our community will learn
> how to collaborate, manage releases, and more.
>
> +1 I would like to learn more about how to integrate pyiceberg-core into
> PyIceberg. The initial setup should give us a framework for future
> integrations.
>
> I also think there's some prerequisite work on the Pyiceberg side to
> clean up FileIO. A lot of features are built specifically with PyArrow
> dependency, such as writing to the table.
>
>
> Thanks,
> Kevin
>
>
> On Mon, Aug 5, 2024 at 12:19 AM Xuanwo  wrote:
>
>> For the FileIO part, just curious—since Rust's FileIO currently also uses
>> OpenDAL, will there be any functional differences in terms of supported
>> storage services or configurations (like profile_name, signer, etc.)
>> compared to using opendalfs directly in Python in the future? Will Rust's
>> FileIO introduce any customizations/optimizations/extensions beyond what
>> OpenDAL supports?
>>
>>
>> Hi, Honah
>>
>> I believe there should be no functional differences. We can implement the
>> exact same thing for both pyiceberg_core FileIO and opendalfs fsspec FileIO.
>>
>> The main difference I've noticed is in where the configuration parsing
>> occurs.
>>
>> The pyiceberg_core FileIO directly exposes the FileIO class, which can
>> inherently understand iceberg properties. We can pass these properties
>> directly to initialize file IO without any additional effort on the
>> pyiceberg side.
>>
>> However, for opendalfs fsspec FileIO, we need to parse the properties and
>> convert them into appropriate opendalfs options for it to function properly.
>>
>> On Mon, Aug 5, 2024, at 15:04, Honah J. wrote:
>>
>> Thanks Xuanwo for driving this and everyone for discussing,
>>
>> I like the idea of pushing down low-level logic to Iceberg-rust
>> (pyiceberg_core). It’s great to have another option besides PyArrow for
>> reading and writing data in PyIceberg. Thanks, Xuanwo, for moving this
>> forward with the initial PR to add pyiceberg_core.
>>
>> For the FileIO part, just curious—since Rust's FileIO currently also uses
>> OpenDAL, will there be any functional differences in terms of supported
>> storage services or configurations (like profile_name, signer, etc.)
>> compared to using opendalfs directly in Python in the future? Will Rust's
>> FileIO introduce any customizations/optimizations/extensions beyond what
>> OpenDAL supports?
>>
>> Best regards,
>> Honah
>>
>>
>>
>> On Sat, Aug 3, 2024 at 4:12 PM timog...@proton.me.INVALID
>>  wrote:
>>
>> Fantastic work! I think this is a great direction, and this provides a
>> good base to start iterating.
>>
>> It makes the most sense to me for the Python bindings (and others) to
>> live in the same repo as iceberg-rust, especially at this early stage.
>>
>> - Tim O'Guin
>>
>>
>>  Original Message 
>> On 8/3/24 12:33 AM, Xuanwo wrote:
>>
>>
>> Let's rock! Welcome to take a review:
>> https://github.com/apache/iceberg-rust/pull/518
>>
>> On Sat, Aug 3, 2024, at 12:13, Xuanwo wrote:
>>
>> I also support integrating iceberg-rust with pyiceberg rather than
>> building something new on OpenDAL.
>>
>> OpenDAL backed FileIO will be usable in Python once opendalfs[1], the
>> native fsspec support for OpenDAL, is ready. Users can use opendalfs as a
>> FileIO class directly in pure python. It's not an action item for our
>> community to take.
>>
>> The consensus we've reached is that iceberg-rust will be the core of
>> PyIceberg. The main question now is "How?" How can we implement it without
>> disrupting our valued users? This is my top priority.
>>
>> *Naming is so hard!

Re: [DISCUSS] PyIceberg 0.7.1 release

2024-08-06 Thread Kevin Liu

I'm +1 for getting 0.7.1 out fast with patches. And target Arrow 17.0.0
upgrade as part of the 0.8.0 release.

Thanks,
Kevin Liu

On Tue, Aug 6, 2024 at 11:43 AM Fokko Driesprong  wrote:

> The issues above sound like patches to me, fixing issues discovered during
>> the 0.7.0 release. Is there a reason to move to 0.8.0?
>>
>
> That would also allow us to add new features :) I'm also okay with a 0.7.1
> release
>
> +1 on this concern. Is it possible to make the Arrow 17.0.0 upgrade
>> optional first? So that folks who want the upgrade can test it out.
>
> If we go with a 0.7.1. How about targeting Arrow 17.0.0 to PyIceberg 0.8.0?
>
> Kind regards,
> Fokko
>
>
>
>
>
>
> Op di 6 aug 2024 om 20:30 schreef Kevin Liu :
>
>> > Typically we only push patches into the minor versions, we could also
>> go to version 0.8.0 immediately.
>>
>> The issues above sound like patches to me, fixing issues discovered
>> during the 0.7.0 release. Is there a reason to move to 0.8.0?
>>
>> > I'm still on the fence regarding 17.0.0 upgrade. There are clear
>> functional upsides, but I feel that constraining PyIceberg to just one
>> published version would make the adoption of PyIceberg difficult for our
>> users.
>>
>> +1 on this concern. Is it possible to make the Arrow 17.0.0 upgrade
>> optional first? So that folks who want the upgrade can test it out.
>>
>> Thanks,
>> Kevin Liu
>>
>>
>>
>> On Fri, Aug 2, 2024 at 11:33 AM Sung Yun  wrote:
>>
>>> Hi Fokko,
>>>
>>> That makes sense, thank you for the suggestion! The issue was quite
>>> severe for us that we had to fork the repo and have a fix ourselves in
>>> order to run PyIceberg without our applications going OOM. So I think there
>>> will be value in getting the proposed config property out as early as
>>> possible for the larger community.
>>>
>>> I'm still on the fence regarding 17.0.0 upgrade. There are clear
>>> functional upsides, but I feel that constraining PyIceberg to just one
>>> published version would make the adoption of PyIceberg difficult for our
>>> users. Users writing new applications won't have trouble with it, but users
>>> intending to use PyIceberg in an existing application may have to upgrade
>>> their PyArrow versions which could be a deterrent (or a welcome nudge).
>>> Would it be worth starting that discussion on a separate thread?
>>>
>>> Sung
>>>
>>> On 2024/08/02 17:57:17 Fokko Driesprong wrote:
>>> > Hey Sung,
>>> >
>>> > Typically we only push patches into the minor versions, we could also
>>> go to
>>> > version 0.8.0 immediately.
>>> >
>>> > Regarding the memory consumption, thanks for putting those numbers
>>> > together! I would also love to get #929
>>> > <https://github.com/apache/iceberg-python/pull/929>, so we can push
>>> down
>>> > the large/small type to PyArrow (only for to_arrow), and apply #986
>>> > <https://github.com/apache/iceberg-python/pull/986> on top if you
>>> want to
>>> > force it to either small or large types.
>>> >
>>> > WDYT?
>>> >
>>> > Kind regards,
>>> > Fokko
>>> >
>>> >
>>> > Op vr 2 aug 2024 om 19:46 schreef Sung Yun :
>>> >
>>> > > Hi folks,
>>> > >
>>> > > We identified inefficient memory usage hikes with the current way of
>>> > > upcasting pyarrow types to large_ on read, when reading tables
>>> with
>>> > > certain characteristics. A detailed set of example benchmarks of
>>> this issue
>>> > > is on the google document linked on PR #986:
>>> > > https://github.com/apache/iceberg-python/pull/986
>>> > >
>>> > > The proposed solution introduces a config to override this behavior
>>> to use
>>> > > small types instead, and I'd like to add this into the patch release
>>> to
>>> > > give users better control over their memory usage.
>>> > >
>>> > > Also, this is just a gentle reminder that this DISCUSS thread is
>>> still
>>> > > open for any new issues that are identified from 0.7.0 release, that
>>> we
>>> > > should fix in the patch release.
>>> > >
>>> > > Thank you,
>>> > > Sung
>>> > >
>>> > > On 2024/07/30 23:57:04 Sung Yun wrote:
>>> > > > Hi folks,
>>> > > >
>>> > > > We are starting to compile the list of issues to fix and port into
>>> the
>>> > > > 0.7.1 release.
>>> > > >
>>> > > > The current list of known issues is as follows:
>>> > > >
>>> > > > Fix pydantic warning on table commit: #972
>>> > > > <https://github.com/apache/iceberg-python/pull/972> (thanks for
>>> the
>>> > > quick
>>> > > > fix ndrluis!)
>>> > > > Issue when rewriting an unpartitioned table: #979
>>> > > > <https://github.com/apache/iceberg-python/issues/979>
>>> > > > Issue when evolving and writing in the same transaction: #980
>>> > > > <https://github.com/apache/iceberg-python/issues/980>
>>> > > >
>>> > > > Please feel free to respond to this thread with any issues that
>>> should be
>>> > > > tracked for the patch release.
>>> > > >
>>> > > > Thank you!
>>> > > > Sung
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

2024-08-06 Thread Kevin Liu

Thanks, Sung for the great writeup and explaining the context.


I am +1 to remove the catalog name as part of the table identifier. In
PyIceberg, each instance of a catalog has an associated name. So any
functions of the catalog instance must be related to that catalog name. In
the case where a single REST endpoint can represent many different catalog
instances, we can use multiple instances of the catalog object. This is
similar to Trino/Spark’s `USE ` feature.


The only potential issue I can see is if, for some reason, we would want to
refer to multiple catalogs in the same statement, i.e. “SELECT * FROM
catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this
being used in any of the current functions.


Regarding the deprecation, I think it’ll be good to do it in multiple
steps. Perhaps we can start with a WARNING log whenever a catalog name is
used while allowing table identifiers both with and without the catalog
name. Then, in a future release, we can completely remove the catalog name.


Thanks,

Kevin Liu

On Wed, Jul 31, 2024 at 2:56 PM Sung Yun  wrote:

> Today in PyIceberg, we have support for identifier parsing in public APIs
> belonging to two different classes:
>
>
>- Catalog class: load_table, purge_table, drop_table
>- Table class: scan
>
>
> These APIs currently have optional support for the identifier that the
> instance itself belongs to.
>
> For example, the catalog class APIs support:
>
>
> *catalog = load_catalog(“animals”,
> **properties)catalog.load_table(“cats.whiskers”)*
>
> But it also supports:
>
> *catalog.load_table(“animals.cats.whiskers”)*
>
> Which is redundant, because the catalog.name is already “animals”.
>
> Similarly, row_filter in the Table scan API supports:
>
>
> *table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs
> == 4”)*
>
> But we also support
>
>
> *table.scan(row_filter=”whiskers.n_legs == 4”)*
> Which is also redundant, because the table name is already “whiskers” (or
> cats.whiskers)
>
> While it sounds like a good change, I’d still like to open this thread to
> discuss the possibility of removing this optional support for the
> instance-level identifier as it will result in a backward incompatible API
> behavior change.
>
> The benefits of this change are as follows:
>
>- As observed above, specifying instance-level identifier in these
>APIs is redundant
>- This optional support adds a lot of complexity to the code base and
>leads to issues like: #742
><https://github.com/apache/iceberg-python/issues/742> It would be
>really great to clean this up before as we prepare for a 1.0.0 later this
>year
>- The optional support opens up the possibility of resulting in
>correctness issues if there exists a name in the level below as the
>instance-level identifier.
>   - For example, if in the above catalog, we have a table namespace
>   named “animals.lower” catalog.load(“animals.lower.cats”) can be 
> construed
>   as table name “cats” in the namespace “animals.lower” but it will be
>   interpreted as table name “cats” in the namespace “lower” which is
>   erroneous.
>   - We would see a similar issue with tables and field names as well.
>   Field name parsing is already complicated because we have to represent
>   nested fields as flat representations. So it would be great to remove 
> one
>   unnecessary level of complication here
>
>
> I'd love to hear from the community on their thoughts on this topic. If
> there are any folks in the community using the optional feature, it would
> be especially great to hear from you as well, on what this change will mean
> for your applications.
>
> Related PR: #963 <https://github.com/apache/iceberg-python/pull/963>
>
> Sung
>

Re: [DISCUSS] Formalized File IO Properties

2024-08-06 Thread Kevin Liu

+1 on standardizing, and possibly extending this to include catalog
properties.

On the PyIceberg side, a recent development is the ability to separate S3
FileIO configurations from the Glue Catalog configurations, with an
optional configuration to use the same for both if specified. See Unified
AWS Credentials
<https://py.iceberg.apache.org/configuration/#unified-aws-credentials> and
Github Issue #892 <https://github.com/apache/iceberg-python/issues/892>

So for AWS credentials, there are currently 3 different properties for
`access-key-id`
* `s3.access-key-id` (S3 FileIO specific)
* `glue.access-key-id` (Glue Catalog specific)
* `client.access-key-id` (Unified)

Thanks,
Kevin Liu




On Wed, Jul 31, 2024 at 10:05 AM Xuanwo  wrote:

> Thanks you all. I'm going to prepare a proposal PR for this.
>
> On Fri, Jul 12, 2024, at 10:06, Honah J. wrote:
>
> Hello everyone,
>
> Thank you all for the valuable insights. I am also +1 on having
> standardized names for File IO properties. Creating a dedicated section to
> summarize property names in the Java implementation is a good starting
> point. Since pyiceberg, icebergRust, and IcebergGolang will support only
> subsets of these properties for some time (with the rest to be added in
> future development), the existing Java implementation will serve as a
> useful reference. Additionally, we could establish general naming
> conventions in the doc, such as using the “s3.” prefix for S3 properties
> and hyphens to connect words.
>
> Best regards,
> Honah
>
> On Wed, Jul 10, 2024 at 10:47 AM  wrote:
>
>
>
> I don't know what the recommended way to start standardizing is. We can
> start a proposal for each context or have one proposal to handle all.
>
> Suggested contexts to start with:
>
>- Rest Catalog
>
>
>- FileIO
>
>
> I believe that most of the other cases are supported by the configuration
> topic in the Table section[1], but this is about the Java implementation.
> Maybe we need to create a page in the project section[2] to handle the
> properties in the table section and the Rest and FileIO contexts.
>
>
> [1]: https://iceberg.apache.org/docs/latest/configuration/
> [2]: https://iceberg.apache.org/community/
> On Wednesday, July 10th, 2024 at 11:58 AM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
> Sounds reasonable to me
>
> On Wed, Jul 10, 2024 at 9:28 AM Renjie Liu 
> wrote:
>
> Hi:
>
> +1 for standardizing iceberg properties. This will help to align different
> language implementations.
>
> On Wed, Jul 10, 2024 at 9:44 PM  wrote:
>
>
> Hello Everyone,
>
> I was considering discussing the standardization of Iceberg properties,
> and I believe this thread could be a great place to start.
>
> I'm writing an Iceberg client in Elixir and using the Java, Python, and
> Rust implementations as references. However, I've had some difficulty
> determining which configurations we must support and what each client has
> implemented. Therefore, I agree with Xuanwo about having a separate
> section as a single source of truth (SSOT).
>
> Additionally, I think it would be beneficial for each client to show what
> it does not support. This would make it easier for users to know that a
> particular client might not work with some configuration that their catalog
> could define as default or override. It would also help us, as
> contributors, to know which configurations we need to implement support for.
>
> For example, the "s3.signer"[1] and "s3.proxy-uri"[2] configurations only
> exist in the Python implementation. I believe it is not clear that these
> configurations are exclusive to Python, and they might be configurations
> that the catalog could override or define as defaults in the get info
> endpoint. Without an SSOT, this could be harder to track.
>
> Another example is the "rest.authorization-url" in Python and Rust versus
> "oauth2_server_uri" in Java. Although this is a bit out of scope for this
> thread, I will open another discussion topic about broader standardization
> of available properties.
>
>
> [1]:
> https://github.com/search?q=repo%3Aapache%2Ficeberg-python+s3.signer&type=code
> [2]:
> https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20S3_PROXY_URI&type=code
>
> On Wednesday, July 10th, 2024 at 7:51 AM, Fokko Driesprong <
> fo...@apache.org> wrote:
>
> Hey Xuanwo,
>
> Thanks for raising this.
>
>- The S3 properties are largely covered under the S3FileIO page:
>https://iceberg.apache.org/docs/nightly/aws/#s3-fileio. But it looks
>like some important ones are missing indeed. I've raised an issue here
><htt

Re: [VOTE] Release Apache PyIceberg 0.7.1rc1

2024-08-09 Thread Kevin Liu

+1 (non-binding)
Verified signatures/checksums/licenses. Ran unit and integration tests.

Sidenote, the new Verifying a release
 instructions work like a
charm!

On Fri, Aug 9, 2024 at 12:30 PM Sung Yun  wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official PyIceberg 0.7.1
> release.
>
> This is a patch release due to the following bugs:
>
> * Fix correctness of applying positional deletes on Merge-On-Read tables
> 
> * Fix overwrite when filtering data
> 
> * Bug fix for deletes across multiple partition specs on partition
> evolution 
> * Fix evolving the table and writing in the same transaction
> 
> * Fix scans when result is empty
> 
> * Fix ListNamespace response in REST Catalog
> 
> * Exclude Python 3.9.7 from list of supported versions
> 
> * Allow setting write.parquet.row-group-limit
> 
> * Allow setting write.parquet.page-row-limit
> 
> 
> * Fix pydantic warning during commit
> 
>
> The commit ID is 5e89fc5c55c09a17ea38a03f0139b54a35786adc
>
> * This corresponds to the tag: pyiceberg-0.7.1rc1
> (3df95c4e6161a1fff7cd8f4a5ff80e45f7c704d3)
> * https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.7.1rc1
> *
> https://github.com/apache/iceberg-python/tree/5e89fc5c55c09a17ea38a03f0139b54a35786adc
>
> The release tarball, signature, and checksums are here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.7.1rc1/
>
> You can find the KEYS file here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged on pypi:
>
> https://pypi.org/project/pyiceberg/0.7.1rc1/
>
> And can be installed using: pip3 install pyiceberg==0.7.1rc1
>
> Instructions for verifying a release can be found here:
>
> * https://py.iceberg.apache.org/verify-release/
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
> [ ] +1 Release this as PyIceberg 0.7.1
> [ ] +0
> [ ] -1 Do not release this because...
>

Re: [DISCUSS] How about setup a iceberg meetup in Beijing?

2024-08-11 Thread Kevin Liu

Hi Xuanwo,

Love the idea! We've been hosting the Seattle area meetup for the last
couple of months and are also helping to coordinate the Bay Area meetup in
September.

Jonathan and I put together this document on some of our learnings and
recommendations for hosting a meetup. Hope this helps!
Apache Iceberg Developer Community Meetup Guidelines
<https://docs.google.com/document/d/1y87k-8nCMczzORIcsePDu6CWTTO0h9qI67W_go8BmPA/edit>

Cheers,
Kevin Liu

On Sun, Aug 11, 2024 at 5:14 AM Renjie Liu  wrote:

> +1
>
> On Sun, Aug 11, 2024 at 16:53 Junwang Zhao  wrote:
>
>> On Sun, Aug 11, 2024 at 3:59 PM Xiaojing Fang 
>> wrote:
>> >
>> > Cool, I’d like to join the meetup.
>> >
>> > > 2024年8月11日 13:24，Xuanwo  写道：
>> > >
>> > > Hello, everyone
>> > >
>> > > I'm starting this thread to discuss the possibility of organizing an
>> iceberg meetup in Beijing.
>> > >
>> > > The proposal is available at
>> https://hackmd.io/@xuanwo/apache-iceberg-beijing-meetup-2024-10
>> > >
>> > > Do you love this idea?
>> > >
>> > > Xuanwo
>> > >
>> > > https://xuanwo.io/
>> >
>>
>> +1
>>
>>
>> --
>> Regards
>> Junwang Zhao
>>
>

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-14 Thread Kevin Liu

+1 (non-binding)
Verified signatures/checksums/licenses. Ran unit and integration tests.

On Thu, Aug 15, 2024 at 2:42 AM Fokko Driesprong  wrote:

> +1 (binding)
>
> Thanks Sung for running this 🙌
>
> - Validated signatures/checksums/license
> - Ran some basic tests (3.10)
>
> Kind regards,
> Fokko
>
> Op wo 14 aug 2024 om 19:57 schreef André Luis Anastácio
> :
>
>>
>>- validated signatures and checksums
>>
>>
>>- checked license
>>
>>
>>- ran tests and test-coverage with Python 3.9.12
>>
>>
>> +1 (non-binding)
>>
>> André Anastácio
>>
>> On Tuesday, August 13th, 2024 at 10:19 PM, Sung Yun 
>> wrote:
>>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official PyIceberg
>> 0.7.1 release.
>>
>> A summary of the high level features:
>>
>> * Fix `delete` to trace existing manifests when a data file is partially
>> rewritten 
>> * Fix 'to_arrow_batch_reader' to respect the limit input arg
>> 
>> * Fix correctness of applying positional deletes on Merge-On-Read tables
>> 
>> * Fix overwrite when filtering data
>> 
>> * Bug fix for deletes across multiple partition specs on partition
>> evolution 
>> * Fix evolving the table and writing in the same transaction
>> 
>> * Fix scans when result is empty
>> 
>> * Fix ListNamespace response in REST Catalog
>> 
>> * Exclude Python 3.9.7 from list of supported versions
>> 
>> * Allow setting write.parquet.row-group-limit
>> 
>> * Allow setting write.parquet.page-row-limit
>> 
>> 
>> * Fix pydantic warning during commit
>> 
>>
>> The commit ID is f92994e85e526502a620506b964665b9afd385fe
>>
>> * This corresponds to the tag: pyiceberg-0.7.1rc2
>> (d33192a3f64e1b5840c691b24a6071768a9fc79b)
>> *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.7.1rc2
>> *
>> https://github.com/apache/iceberg-python/tree/f92994e85e526502a620506b964665b9afd385fe
>>
>> The release tarball, signature, and checksums are here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.7.1rc2/
>>
>> You can find the KEYS file here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on pypi:
>>
>> https://pypi.org/project/pyiceberg/0.7.1rc2/
>>
>> And can be installed using: pip3 install pyiceberg==0.7.1rc2
>>
>> Instructions for verifying a release can be found here:
>>
>> * https://py.iceberg.apache.org/verify-release/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>> [ ] +1 Release this as PyIceberg 0.7.1
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>>
>>

Re: [DISCUSS] Iceberg Rust Sync Meeting

2024-10-09 Thread Kevin Liu

+1 on sync meeting for iceberg rust. I want to get involved and catch up on
the recent developments. For reference, here's the doc we've been using for
the pyiceberg sync
https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U

Best,
Kevin

On Wed, Oct 9, 2024 at 5:30 AM Xuanwo  wrote:

> Hi,
>
> I'm starting this thread to explore the idea of hosting an Iceberg Rust
> Sync Meeting. In this meeting, we will discuss recent major changes,
> pending PR reviews, and features in development. It will offer a space for
> Iceberg Rust contributors to connect and become familiar with each other,
> helping us identify and remove contribution barriers to the best of our
> ability.
>
> Details about this meeeting:
>
> I suggest hosting our meeting at the same time of day, but one week
> earlier than the Iceberg Sync Meeting. For example, if the Iceberg Sync
> Meeting is scheduled for Thursday, October 24, 2024, from 00:00 to 01:00
> GMT+8, the Iceberg Rust Sync Meeting would take place one week before, on
> Thursday, October 17, 2024, from 00:00 to 01:00 GMT+8.
>
> I also suggest using the same Google Meet code (if possible) so we don't
> get confused.
>
> These meetings will not be recorded, but I will take notes in a Google
> Doc, similar to what we do in the Iceberg Sync Meeting.
>
> What are your thoughts? I'm open to other options as well.
>
> Xuanwo
>
> https://xuanwo.io/
>

[Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-08 Thread Kevin Liu

Hi all,

I wanted to bring up a suggestion regarding our current documentation. The
existing examples for Iceberg often use the Hadoop catalog, as seen in:

   - Adding a Catalog - Spark Quickstart [1]
   - Adding Catalogs - Spark Getting Started [2]

Since we generally advise against using Hadoop catalogs in production
environments, I believe it would be beneficial to replace these examples
with ones that use the JDBC catalog. The JDBC catalog, configured with a
local SQLite database file, offers similar convenience but aligns better
with production best practices.

I've created an issue [3] and a PR [4] to address this. Please take a look,
and I'd love to hear your thoughts on whether this is a direction we want
to pursue.

Best,
Kevin Liu

[1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
[2]
https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
[3] https://github.com/apache/iceberg/issues/11284
[4] https://github.com/apache/iceberg/pull/11285

Re: Bayarea Iceberg meetup in November

2024-10-04 Thread Kevin Liu

Excited about this, looking forward to it!

Best,
Kevin

On Thu, Oct 3, 2024 at 6:11 PM Aihua Xu 
wrote:

> Hi community!
>
> The Apache Iceberg community is gathering in San Francisco on November
> 4th. Whether you’re interested in presenting or just want to join us for
> some networking, you can find the event details and RSVP here:
> https://lu.ma/fholq6oz . Hope to see you there!
>
> Thanks,
> Aihua
>

Re: Clarification on DayTransform Result Type

2024-10-07 Thread Kevin Liu

Thanks for confirming!

To close the loop on this issue, we have added more documentation about the
`result_type` function in PyIceberg. This clarifies the physical and
display representations of partition transforms. For DayTransform, the
physical representation is `int`, while the display representation is
`date`. This conforms to the spec and aligns with Spark's behavior. The
changes have been made in apache/iceberg-python#1211
<https://github.com/apache/iceberg-python/pull/1211>.

Thanks,
Kevin

On Mon, Oct 7, 2024 at 4:53 PM rdb...@gmail.com  wrote:

> Yes. When we return the Spark type, it shows up as date and Spark
> correctly displays the value.
>
> On Mon, Sep 30, 2024 at 9:56 AM Kevin Liu  wrote:
>
>> Thank you both for the insights and context.
>>
>> As Russell pointed out, the "day partition transform" result is true of
>> int type. The Types.DateType
>> <https://github.com/apache/iceberg/blob/dddb5f423b353d961b8a08eb2cb4371d453c2959/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47>
>>  corresponds
>> to TypeID.DATE
>> <https://github.com/apache/iceberg/blob/09370ddbc39fc3920fb8cbd3dff11b377dd37e40/api/src/main/java/org/apache/iceberg/types/Types.java#L181>,
>> which is also an Integer type
>> <https://github.com/apache/iceberg/blob/113c6e7d62e53d3e3cb15b1712f3a1db473ca940/api/src/main/java/org/apache/iceberg/types/Type.java#L37>.
>> So, this behavior conforms to the spec.
>>
>> The issue with DayTransform in PyIceberg (#1208
>> <https://github.com/apache/iceberg-python/pull/1208>) is due to the
>> changes in the PR. The problem arises from how the partition value is
>> displayed in the partition metadata table. As Ryan mentioned, Spark
>> displays the partition value as `date`. However, the PR removes
>> `DateType` as the `result_type`, which causes PyIceberg to display the
>> partition value as `int` since the epoch.
>>
>> > if we just change the type to `date`, engines could correctly display
>> the value
>>
>> I found a related discussion in apache/iceberg/#279
>> <https://github.com/apache/iceberg/issues/279#issuecomment-521322801>,
>> specifically: "That will cause the partition tuple's field type to be a
>> date, which should also cause the metadata table to display formatted dates
>> instead of the day ordinal in Spark." I want to confirm my understanding:
>> is this behavior due to the Iceberg-to-Spark DateType conversion in `
>> <https://github.com/apache/iceberg/blob/main/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L103-L104>
>> TypeToSparkType`
>> <https://github.com/apache/iceberg/blob/09370ddbc39fc3920fb8cbd3dff11b377dd37e40/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L103-L104>
>> ?
>>
>> Best,
>> Kevin
>>
>>
>>
>> On Fri, Sep 27, 2024 at 1:52 PM rdb...@gmail.com 
>> wrote:
>>
>>> The background is that the result of the day function and dates are
>>> basically the same: the number of days from the Unix epoch. When we started
>>> using metadata tables, we realized that a lot of people use the day
>>> function but then get a weird ordinal value out, but if we just change the
>>> type to `date`, engines could correctly display the value. This isn't
>>> required by the spec, it's just a convenience.
>>>
>>> On Fri, Sep 27, 2024 at 8:30 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Good thing DateType is an Integer :)
>>>> https://github.com/apache/iceberg/blob/113c6e7d62e53d3e3cb15b1712f3a1db473ca940/api/src/main/java/org/apache/iceberg/types/Type.java#L37
>>>>
>>>> On Thu, Sep 26, 2024 at 8:38 PM Kevin Liu 
>>>> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> While reviewing a PR to fix DayTransform in PyIceberg (#1208
>>>>> <https://github.com/apache/iceberg-python/pull/1208>), we found an
>>>>> inconsistency between the spec and the Java Iceberg library.
>>>>>
>>>>> According to the spec
>>>>> <https://iceberg.apache.org/spec/#partition-transforms>, the result
>>>>> type for the "day partition transform" should be `int`, similar to other
>>>>> time-based partition transforms (year/month/hour). However, in the Java
>>>>> Iceberg library, the result type for day partition transform is 
>>>>> `DateType` (
>>>>> source
>>>>> <https://github.com/apache/iceberg/blob/dddb5f423b353d961b8a08eb2cb4371d453c2959/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47>).
>>>>> This seems to be a discrepancy from the spec, as the day partition
>>>>> transform is the only time-based transform with a non-int result
>>>>> type—whereas the others use IntegerType (source
>>>>> <https://grep.app/search?q=getResultType&filter[repo][0]=apache/iceberg&filter[path][0]=api/src/main/java/org/apache/iceberg/>
>>>>> ).
>>>>>
>>>>> Could someone confirm if my understanding is correct? If so, is there
>>>>> any historical context for this difference? Lastly, how should we approach
>>>>> resolving this moving forward?
>>>>>
>>>>> Best,
>>>>> Kevin
>>>>>
>>>>>

Re: Greater Seattle Iceberg Meetup

2024-10-18 Thread Kevin Liu

Thanks, Jonathan. Looking forward to seeing everyone!

Please let us know if you would like to present by filling out this form
https://docs.google.com/forms/d/1vic-6nUYbUTsf_WmyQ0kB_7BrSdpT97Rfh2cL3cs2kw/edit

Best,
Kevin Liu

On Fri, Oct 18, 2024 at 4:16 PM Jonathan Leang 
wrote:

> Hi everyone,
>
> We're continuing to do community meetups in the Seattle area! Details are
> below:
>
> Connect with fellow enthusiasts, share insights, and dive into the latest
> developments in the Apache Iceberg ecosystem! Whether you're a seasoned pro
> or new to Apache Iceberg, this meetup is the perfect place to exchange
> ideas and spark innovation.
>
> We will be hosting the event on November 13th from 5:00 PM to 8:30 PM in
> Bellevue. Please RSVP using this link: <https://lu.ma/kxi04g2m>
>
> In this meetup we are looking to host a couple talks, so if you're
> working on something or want to share an idea please respond to the call
> for talks in the registration list above!
>
> See you there!
> Jonathan Leang
>

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-20 Thread Kevin Liu

Hey folks,

Thanks, everyone for the discussion, and thanks Ryan for providing the
historical context.
Enforce the `operation` key in Snapshot’s `summary` field

When serializing the `Snapshot` object from JSON, the Java implementation
does not enforce that the `summary` field must contain an `operation` key.
In the V1 spec, the `summary` field is optional, while in the V2 spec, it
is required. However, in both versions, if a `summary` field is present, it
must include an `operation` key. Any `summary` field lacking an `operation`
key should be considered invalid.

I’ve addressed this issue in PR 11354 [1] by adding this constraint when
parsing JSON.

> We initially did not have the snapshot summary or operation. When I added
the summary, the operation was intended to be required in cases where the
summary is present. It should always be there if the summary is and the
summary should always be there unless you wrote the metadata.json file way
back in 2017 or 2018.

@Ryan, does this constraint also apply to `metadata.json` files from
2017/2018? Was it ever valid to have a `summary` field without the
`operation` key?

> Well, the spec says nothing about a top-level `operation` field in JSON
[1]. Yet the Java implementation produces it [2] and removes the operation
from the summary map. This seems inconsistent?

@Anton, the Java `Snapshot` object includes both the `summary` and
`operation` fields. When serializing to JSON, the `operation` field is
included in the `summary` map [2], rather than as a top-level field. During
deserialization from JSON, the `operation` field is extracted from the
`summary` map [3].

I believe this is consistent with the table spec, which defines the JSON
output, not how the `Snapshot` object is implemented in Java.
On REST spec and Table spec

Thanks, Yufei, for highlighting the difference between the REST spec and
the table spec. I mistakenly used the REST spec
(`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables.

Looking at the REST spec file, it can be challenging to determine how a
REST server should handle V1 versus V2 tables. Even for V2 tables, the
current version of the file combines features from V2, along with
additional changes made in preparation for the upcoming V3 spec.

Would it be helpful to create alternative versions of the REST spec
specifically for referencing V1 and V2 tables? The goal would be to have a
"frozen" version of the REST spec dedicated to V1 tables and another for V2
tables while allowing the current REST spec file to evolve as needed.

Taking a step back, I think we need more documentation on the REST spec,
including support for different table versions and guidance on upgrading
from one version to another. I’d love to hear everyone’s thoughts on this.

Best,

Kevin Liu

[1] https://github.com/apache/iceberg/pull/11354

[2]
https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66

[3]
https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137

[4]
https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml

On Sat, Oct 19, 2024 at 7:48 PM Sung Yun  wrote:

> Hi Ryan, thank you for your response!
>
> That detailed context is very helpful in allowing me to understanding why
> the REST catalog spec has evolved the way it has, and how the Table Spec
> and the REST Catalog Spec should each be referenced in the sub-communities
> (like in PyIceberg). I'll keep those motivations in mind as we discuss
> those Specs in the future.
>
> Also, here's a small PR to specify more explicitly that the operation
> field should be a required field in the summary field:
> https://github.com/apache/iceberg/pull/11355
>
> Sung
>
> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote:
> > I can provide some historical context here about how the table spec
> evolved
> > and how the REST spec works with respect to table versions.
> >
> > We initially did not have the snapshot summary or operation. When I added
> > the summary, the operation was intended to be required in cases where the
> > summary is present. It should always be there if the summary is and the
> > summary should always be there unless you wrote the metadata.json file
> way
> > back in 2017 or 2018. It looks like the spec could be more clear that the
> > operation is required when summary is present. Anyone want to open a PR?
> >
> > Anton, I don't think there is a top-level operation field. The Java
> > Snapshot class tracks the operation as top-level, but it is always stored
> > in the summary. I think this is consistent with the spec.
> >
> > For the REST spec, I think that it should be strictly optional to

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-21 Thread Kevin Liu

> No. They were introduced at the same time.
Great! Since the `summary` field and the `operation` key were introduced
together, we should enforce the rule that the `summary` field must always
have an accompanying `operation` key. This has been addressed in PR 11354
[1].

> I am strongly against this. The REST spec should be independent of the
table versions.
That makes sense. For the REST spec to support both V1 and V2 tables, it
should "accept" the least common denominator between the two versions. For
example, the `Snapshot` `summary` field is optional in V1 but required in
V2. Therefore, the REST spec definition should mark the `summary` field as
optional to support both versions. However, the current REST spec leans
towards the V2 table spec; fields that are optional in V1 and required in
V2 are marked as required in the spec, such as `TableMetadata.table-uuid`
[2][3] and `Snapshot.summary` [4][5].

Would love to get other people's thoughts on this.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg/pull/11354
[2]
https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414
[3] https://iceberg.apache.org/spec/#table-metadata-fields
[4]
https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325
[5] https://iceberg.apache.org/spec/#snapshots

On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com  wrote:

> Was it ever valid to have a summary field without the operation key?
>
> No. They were introduced at the same time.
>
> Would it be helpful to create alternative versions of the REST spec
> specifically for referencing V1 and V2 tables?
>
> I am strongly against this. The REST spec should be independent of the
> table versions. Any table format version can be passed and the table format
> should be the canonical reference for what is allowed. We want to avoid
> cases where there are discrepancies. The table spec is canonical for table
> metadata, and the REST spec allows passing it.
>
> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu  wrote:
>
>> Hey folks,
>>
>> Thanks, everyone for the discussion, and thanks Ryan for providing the
>> historical context.
>> Enforce the `operation` key in Snapshot’s `summary` field
>>
>> When serializing the `Snapshot` object from JSON, the Java implementation
>> does not enforce that the `summary` field must contain an `operation` key.
>> In the V1 spec, the `summary` field is optional, while in the V2 spec, it
>> is required. However, in both versions, if a `summary` field is present, it
>> must include an `operation` key. Any `summary` field lacking an `operation`
>> key should be considered invalid.
>>
>> I’ve addressed this issue in PR 11354 [1] by adding this constraint when
>> parsing JSON.
>>
>> > We initially did not have the snapshot summary or operation. When I
>> added the summary, the operation was intended to be required in cases where
>> the summary is present. It should always be there if the summary is and the
>> summary should always be there unless you wrote the metadata.json file
>> way back in 2017 or 2018.
>>
>> @Ryan, does this constraint also apply to `metadata.json` files from
>> 2017/2018? Was it ever valid to have a `summary` field without the
>> `operation` key?
>>
>> > Well, the spec says nothing about a top-level `operation` field in JSON
>> [1]. Yet the Java implementation produces it [2] and removes the operation
>> from the summary map. This seems inconsistent?
>>
>> @Anton, the Java `Snapshot` object includes both the `summary` and
>> `operation` fields. When serializing to JSON, the `operation` field is
>> included in the `summary` map [2], rather than as a top-level field. During
>> deserialization from JSON, the `operation` field is extracted from the
>> `summary` map [3].
>>
>> I believe this is consistent with the table spec, which defines the JSON
>> output, not how the `Snapshot` object is implemented in Java.
>> On REST spec and Table spec
>>
>> Thanks, Yufei, for highlighting the difference between the REST spec and
>> the table spec. I mistakenly used the REST spec
>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables.
>>
>> Looking at the REST spec file, it can be challenging to determine how a
>> REST server should handle V1 versus V2 tables. Even for V2 tables, the
>> current version of the file combines features from V2, along with
>> additional changes made in preparation for the upcoming V3 spec.
>>
>> Would it be helpful to create alternative versions of the REST spec
>> specifically for referencing V1 an

Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-16 Thread Kevin Liu

Hey folks,

Thanks for the discussions.

It seems everyone is in favor of replacing the Hadoop catalog example, and
the question now is whether to replace it with the JDBC catalog or the REST
catalog.

I originally proposed the JDBC catalog as a replacement primarily due to
its ease of use. Users can quickly set up a JDBC catalog backed by an
in-memory or file-based datastore without needing additional
infrastructure. It also aligns with the quick-start ethos of "it just
works." That said, I agree that an example of setting up the REST catalog
should be part of the getting-started guide since it’s the catalog the
community has aligned on.

Here's what I propose as a middle-ground.

   1. We replace the Hadoop catalog example with a JDBC catalog backed by
   an in-memory datastore. This allows users to get started without needing
   additional infrastructure, which was one of the main benefits of the Hadoop
   catalog.
   2. We add a new section describing the REST catalog, its benefits, and
   how to set one up. We can use the REST catalog adapter [1], with the
   adapter using the JDBC catalog as its internal catalog.

This approach gives users a way to quickly prototype while also guiding
them toward the REST catalog for production use cases.

Looking forward to hearing more from you all.

Best,

Kevin Liu

[1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq

On Thu, Oct 10, 2024 at 3:44 AM Eduard Tudenhöfner 
wrote:

> I would prefer to advocate for the REST catalog in those examples/docs
> (similar to how the Spark quickstart example
> <https://iceberg.apache.org/spark-quickstart/> uses the REST catalog).
> The docs could then refer to the quickstart example to indicate what's
> required in terms of services to be started before a user can spawn a spark
> shell.
>
> On Thu, Oct 10, 2024 at 12:15 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> As we are talking about "documentation" (quick start/readme), I would
>> rather propose to use the REST catalog here instead of JDBC.
>>
>> As it's the catalog we "promote", I think it would be valuable for
>> users to start with the "right thing".
>>
>> JDBC Catalog is interesting for quick test/started guide, but we know
>> how it goes: it will be heavily use (see what happened with the
>> HadoopCatalog used in production whereas it should not :) ).
>>
>> Regards
>> JB
>>
>> On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu  wrote:
>> >
>> > Hi all,
>> >
>> > I wanted to bring up a suggestion regarding our current documentation.
>> The existing examples for Iceberg often use the Hadoop catalog, as seen in:
>> >
>> > Adding a Catalog - Spark Quickstart [1]
>> > Adding Catalogs - Spark Getting Started [2]
>> >
>> > Since we generally advise against using Hadoop catalogs in production
>> environments, I believe it would be beneficial to replace these examples
>> with ones that use the JDBC catalog. The JDBC catalog, configured with a
>> local SQLite database file, offers similar convenience but aligns better
>> with production best practices.
>> >
>> > I've created an issue [3] and a PR [4] to address this. Please take a
>> look, and I'd love to hear your thoughts on whether this is a direction we
>> want to pursue.
>> >
>> > Best,
>> > Kevin Liu
>> >
>> > [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
>> > [2]
>> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
>> > [3] https://github.com/apache/iceberg/issues/11284
>> > [4] https://github.com/apache/iceberg/pull/11285
>> >
>>
>

Re: [DISCUSS] [PyIceberg] Use of asserts to "programming the negative space"

2024-10-16 Thread Kevin Liu

Thanks for starting this discussion! I think the defensive programming
approach is useful to maintain assumptions, especially in some
public-facing APIs. Here is an example I recently encountered [1]; we
currently disallow using the `add_files` API for parquet files with
field IDs. However, I'm not sure where we can draw the line between using
assert/pre-conditions and using if/raise clauses.

Best,
Kevin Liu

[1]
https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/io/pyarrow.py#L2515-L2518


On Wed, Oct 16, 2024 at 12:02 AM Piotr Findeisen 
wrote:

> Hi Andre,
>
> My Python skills aren't up to date, so I will abstain from recommending a
> particular solution.
> Writing a precondition module sounds like a fun task, but perhaps we could
> research alternatives first.
> For example quick google search brought me to
> https://pypi.org/project/preconditions/
> https://pypi.org/project/guava-preconditions/
>
> quick gpt chat brought me to
> https://pypi.org/project/PyContracts/
> https://docs.pydantic.dev/latest
> https://docs.python-cerberus.org/
> https://www.attrs.org/en/stable/
>
> These are not my recommendations to use. These are only my recommendations
> for deeper research if we are about to roll something on our own. It sounds
> unlikely that such a fundamental need is not addressed in Python ecosystem.
>
> Best
> Piotr
>
>
>
>
> On Tue, 15 Oct 2024 at 01:53, André Luis Anastácio
>  wrote:
>
>> Thank you Piotr Findeisen and Sung Yun, for your insights.
>>
>> I did a quick search and didn’t find anything more "pythonic." We could
>> just use an if statement with raise, but I have some mixed feelings about
>> that.
>>
>> Maybe we could create a precondition module with a function or a
>> decorator. What do you think?
>>
>> Preconditions in the Java Iceberg implementation:
>> https://github.com/search?q=repo%3Aapache%2Ficeberg%20Preconditions&type=code
>>
>> Best regards,
>> André Anastácio
>>
>> On Saturday, October 12th, 2024 at 5:40 PM, Sung Yun 
>> wrote:
>>
>> Hi André,
>>
>> Thank you for starting off this discussion! This is a fun topic, so I’m
>> keen on seeing what the rest of the folks in the PyIceberg community think
>> as well :)
>>
>> I’m of the opinion that ‘assert’ should only be used within test suites,
>> because setting the optimize flag (-O) in the Python interpreter can
>> disable asserts. And I agree with Piotr that having two separate flows
>> available within Production code that can be triggered with the flag will
>> make it more difficult for the community to debug specific behaviors.
>>
>> For this reason, the Ruff linter also checks for the usage of assert
>> statements in S101:
>> https://docs.astral.sh/ruff/rules/#flake8-bandit-s
>>
>> Sung
>>
>> On Sat, Oct 12, 2024 at 2:40 PM Piotr Findeisen <
>> piotr.findei...@gmail.com> wrote:
>>
>>> Hi Andre,
>>>
>>> I am not very familiar with PyIceberg, but i am always for ensuring that
>>> assumptions in our code are validated.
>>>
>>> I am not quite sure that assert is the way to go though.
>>> In Java, one typically does not use `assert`, which can be enabled or
>>> disabled.
>>> checkState / checkArgument are preferred, because they are always on.
>>> There are two important reasons: validating assumptions in test
>>> (non-production) code is usually useless. The test outcome provide mostly
>>> already the necessary coverage. It's much more useful to validate
>>> assumptions always.
>>> Having assert that are disabled or enabled means there are two flows of
>>> the code (assert expression can have side effects!), how do you test for
>>> that? Having one flow is a simplification.
>>> This is a reasoning for Java codebase, but I don't think arguments are
>>> language-specific, so I believe same can be argued about python code too.
>>>
>>> Best
>>> Piotr
>>>
>>>
>>>
>>>
>>>
>>> On Thu, 10 Oct 2024 at 05:26, André Luis Anastácio
>>>  wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I would like to open a discussion about using "assert" in some
>>>> functions to promote a more defensive programming approach, ensuring that
>>>> certain assumptions in our code are always validated.
>>>>
>>>> The intention here is to propose a recommendation, not a strict rule.
>>>> What are your thoughts on this?
>>>>
>>>> In the Java implementation repository, we have some code that follows
>>>> this approach in Scala code [1]. I'm not very familiar with Scala, so I’m
>>>> not sure if this is a common pattern, but I believe we could improve the
>>>> quality of our Python code by adopting a similar approach.
>>>>
>>>> You can find a reference discussing this approach here
>>>> https://ratfactor.com/cards/tiger-style
>>>>
>>>> [1]
>>>> https://github.com/search?q=repo%3Aapache%2Ficeberg+assert++language%3AScala&type=code&l=Scala
>>>>
>>>> Best regards,
>>>>
>>>> André Anastácio
>>>>
>>>
>>

[DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-16 Thread Kevin Liu

Hey folks,

I’ve noticed a discrepancy between the Iceberg specification and the Java
implementation regarding the `operation` key in the `Snapshot` `summary`
field.

The `Snapshot` object's `summary` dictionary includes a *required* key
named `operation`, as outlined in the spec describing Table Metadata and
Snapshots [1] and the generated OpenAPI YAML [2]. However, in the Java
implementation [3], `operation` is treated as optional. In contrast, it
remains a required field in the Python implementation [4].
I also found that Java tests for `SnapshotParser` assert that the
`operation` field is null. [5]

Due to this discrepancy, a user reported [6] that the `metadata.json` file
generated for an Iceberg table could not be read by PyIceberg, though it is
readable using the Iceberg Java library.

How should we proceed from here? Should the Java library enforce this
requirement? Additionally, how should we handle existing `metadata.json`
files that were generated without this field?

Best,
Kevin Liu

[1] https://iceberg.apache.org/spec/#table-metadata-and-snapshots
[2]
https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
[3]
https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124
[4]
https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182
[5]
https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52
[6] https://github.com/apache/iceberg-python/issues/1106

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-27 Thread Kevin Liu

Hi Ryan,

I've created a revert PR [1]. I agree that we should take a more permissive
approach when reading a table, allowing for reading non-compliant table
metadata, especially for an opportunity to "fix" the metadata. However, I
think we still need a way to enforce the table specification to ensure that
other operations interact with a compliant table.

Perhaps we could permit `SnapshotParser` to read the metadata but enforce
the `operation` field at a different location to guarantee the table’s
compliance with the specification.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg/pull/11409

On Sat, Oct 26, 2024 at 11:05 AM rdb...@gmail.com  wrote:

> I see it's been merged, but I don't think it is a good idea to enforce
> this. The spec can and should require the `operation` but we want to be
> careful about creating situations where bad metadata can needlessly break a
> table. I would be much more permissive here, which is why this probably
> wasn't enforced in the first place.
>
> On Fri, Oct 25, 2024 at 2:36 PM Kevin Liu  wrote:
>
>> Thanks, everyone! The PR[1] has been merged
>>
>> Best,
>> Kevin Liu
>>
>> [1] https://github.com/apache/iceberg/pull/11354
>>
>>
>> On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu  wrote:
>>
>>> Thanks, Ryan! That makes sense.
>>>
>>> I want to follow up on the original issue. I've made a PR [1] to enforce
>>> that the Snapshot `summary` map must have an `operation` key. Please take a
>>> look. Thank you @nastra for the comments and reviews.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> [1] https://github.com/apache/iceberg/pull/11354
>>>
>>>
>>>
>>> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com 
>>> wrote:
>>>
>>>> > For example, the `Snapshot` `summary` field is optional in V1 but
>>>> required in V2. Therefore, the REST spec definition should mark the
>>>> `summary` field as optional to support both versions.
>>>>
>>>> Yeah, this is technically true. But as I said in my first email, unless
>>>> you have tables that are 5 years old, it's unlikely that this is going to
>>>> be a problem. A failure here is more likely with newer implementations that
>>>> have a bug. So I'd argue there's value in leaving it as required.
>>>>
>>>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu 
>>>> wrote:
>>>>
>>>>> > No. They were introduced at the same time.
>>>>> Great! Since the `summary` field and the `operation` key were
>>>>> introduced together, we should enforce the rule that the `summary`
>>>>> field must always have an accompanying `operation` key. This has been
>>>>> addressed in PR 11354 [1].
>>>>>
>>>>> > I am strongly against this. The REST spec should be independent of
>>>>> the table versions.
>>>>> That makes sense. For the REST spec to support both V1 and V2 tables,
>>>>> it should "accept" the least common denominator between the two versions.
>>>>> For example, the `Snapshot` `summary` field is optional in V1 but required
>>>>> in V2. Therefore, the REST spec definition should mark the `summary` field
>>>>> as optional to support both versions. However, the current REST spec leans
>>>>> towards the V2 table spec; fields that are optional in V1 and required in
>>>>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid`
>>>>> [2][3] and `Snapshot.summary` [4][5].
>>>>>
>>>>> Would love to get other people's thoughts on this.
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/11354
>>>>> [2]
>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414
>>>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields
>>>>> [4]
>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325
>>>>> [5] https://iceberg.apache.org/spec/#snapshots
>>>>>
>>>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Was it ever valid to have a summary field without the operation key?
>>>>>>
>>>>>> No. They

Re: [VOTE][Go] Release Apache Iceberg Go v0.1.0 RC0

2024-11-11 Thread Kevin Liu

Hi Matt,

Thanks for the release candidate! +1 (non-binding). I was able to download,
verify checksums and signatures, and run the unit tests successfully after
making a few changes locally.

I tried to follow the verification steps outlined in
https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
and ran into a couple of issues.

On the `main` branch, I ran `dev/release/verify_rc.sh 0.1.0 0`. The script
failed with
```
+ fetch_archive
+ download_rc_file apache-iceberg-go-0.1.0.tar.gz
+ '[' 1 -gt 0 ']'
+ download
https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz
+ curl --fail --location --remote-name --show-error --silent
https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz
curl: (22) The requested URL returned error: 404
```
I think the issue is with this line.
https://github.com/apache/iceberg-go/blob/adc8193de3299b04c9763c2fba529a7b94d080ce/dev/release/verify_rc.sh#L102
which expects the file name to be in the form of
`apache-iceberg-go-${VERSION}` (`
https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz`
<https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz>
)
However, the actual file produced on Github is in the form of
`apache-iceberg-go-0.1.0-rc0.tar.gz`, notice the extra `rc0`. See the
assets at https://github.com/apache/iceberg-go/releases/v0.1.0-rc0

After making a change locally,
```
ARCHIVE_BASE_NAME="apache-iceberg-go-${VERSION}-rc${RC}"
```
I was able to download the artifacts. Running `dev/release/verify_rc.sh
0.1.0 0` again, I got this error
```
gpg: Signature made Mon Nov 11 07:58:21 2024 PST
gpg:using RSA key 74EE211E32BF1DF9D984FA394B86A1E5E59C8B81
gpg: Can't check signature: No public key
```
It looks like that KEY is only in
https://dist.apache.org/repos/dist/release/iceberg/KEYS but not in
https://dist.apache.org/repos/dist/dev/iceberg/KEYS which the script uses.

After making the change locally,
```
ICEBERG_DIST_BASE_URL="https://dist.apache.org/repos/dist/release/iceberg";
```
I was able to run `dev/release/verify_rc.sh 0.1.0 0` successfully.

```
+ VERIFY_SUCCESS=yes
+ echo 'RC looks good!'
RC looks good!
```

Should we make the necessary changes in `verify_rc.sh` and also upload the
KEYS to https://dist.apache.org/repos/dist/dev/iceberg/KEYS?

Best,
Kevin Liu

On Mon, Nov 11, 2024 at 2:12 PM Matt Topol  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC0) of Apache
> Iceberg Go version v0.1.0.
>
> This release candidate is based on
> commit: adc8193de3299b04c9763c2fba529a7b94d080ce [1]
>
> The source release rc0 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests, and
> vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Iceberg Go v0.1.0
> [ ] +0
> [ ] -1 Do not release this as Apache Iceberg Go v0.1.0 because...
>
> Thanks!
> --Matt
>
> [1]:
> https://github.com/apache/iceberg-go/tree/adc8193de3299b04c9763c2fba529a7b94d080ce
> [2]: https://github.com/apache/iceberg-go/releases/v0.1.0-rc0
> [3]:
> https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
>

Re: [VOTE][Go] Release Apache Iceberg Go v0.1.0 RC0

2024-11-11 Thread Kevin Liu

BTW for folks verifying this RC, these are the changes I made locally for
the `dev/release/verify_rc.sh` script to work.
https://github.com/apache/iceberg-go/pull/199/files

Best,
Kevin Liu

On Mon, Nov 11, 2024 at 3:03 PM Kevin Liu  wrote:

> Hi Matt,
>
> Thanks for the release candidate! +1 (non-binding). I was able to download,
> verify checksums and signatures, and run the unit tests successfully after
> making a few changes locally.
>
>
> I tried to follow the verification steps outlined in
> https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
> and ran into a couple of issues.
>
> On the `main` branch, I ran `dev/release/verify_rc.sh 0.1.0 0`. The script
> failed with
> ```
> + fetch_archive
> + download_rc_file apache-iceberg-go-0.1.0.tar.gz
> + '[' 1 -gt 0 ']'
> + download
> https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz
> + curl --fail --location --remote-name --show-error --silent
> https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz
> curl: (22) The requested URL returned error: 404
> ```
> I think the issue is with this line.
> https://github.com/apache/iceberg-go/blob/adc8193de3299b04c9763c2fba529a7b94d080ce/dev/release/verify_rc.sh#L102
> which expects the file name to be in the form of
> `apache-iceberg-go-${VERSION}` (`
> https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz`
> <https://github.com/apache/iceberg-go/releases/download/v0.1.0-rc0/apache-iceberg-go-0.1.0.tar.gz>
> )
> However, the actual file produced on Github is in the form of
> `apache-iceberg-go-0.1.0-rc0.tar.gz`, notice the extra `rc0`. See the
> assets at https://github.com/apache/iceberg-go/releases/v0.1.0-rc0
>
> After making a change locally,
> ```
> ARCHIVE_BASE_NAME="apache-iceberg-go-${VERSION}-rc${RC}"
> ```
> I was able to download the artifacts. Running `dev/release/verify_rc.sh
> 0.1.0 0` again, I got this error
> ```
> gpg: Signature made Mon Nov 11 07:58:21 2024 PST
> gpg:using RSA key 74EE211E32BF1DF9D984FA394B86A1E5E59C8B81
> gpg: Can't check signature: No public key
> ```
> It looks like that KEY is only in
> https://dist.apache.org/repos/dist/release/iceberg/KEYS but not in
> https://dist.apache.org/repos/dist/dev/iceberg/KEYS which the script
> uses.
>
> After making the change locally,
> ```
> ICEBERG_DIST_BASE_URL="https://dist.apache.org/repos/dist/release/iceberg";
> ```
> I was able to run `dev/release/verify_rc.sh 0.1.0 0` successfully.
>
> ```
> + VERIFY_SUCCESS=yes
> + echo 'RC looks good!'
> RC looks good!
> ```
>
> Should we make the necessary changes in `verify_rc.sh` and also upload the
> KEYS to https://dist.apache.org/repos/dist/dev/iceberg/KEYS?
>
> Best,
> Kevin Liu
>
>
> On Mon, Nov 11, 2024 at 2:12 PM Matt Topol  wrote:
>
>> Hi,
>>
>> I would like to propose the following release candidate (RC0) of Apache
>> Iceberg Go version v0.1.0.
>>
>> This release candidate is based on
>> commit: adc8193de3299b04c9763c2fba529a7b94d080ce [1]
>>
>> The source release rc0 is hosted at [2].
>>
>> Please download, verify checksums and signatures, run the unit tests, and
>> vote on the release. See [3] for how to validate a release candidate.
>>
>> The vote will be open for at least 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg Go v0.1.0
>> [ ] +0
>> [ ] -1 Do not release this as Apache Iceberg Go v0.1.0 because...
>>
>> Thanks!
>> --Matt
>>
>> [1]:
>> https://github.com/apache/iceberg-go/tree/adc8193de3299b04c9763c2fba529a7b94d080ce
>> [2]: https://github.com/apache/iceberg-go/releases/v0.1.0-rc0
>> [3]:
>> https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
>>
>

Re: Greater Seattle Iceberg Meetup

2024-11-11 Thread Kevin Liu

Bumping this thread one last time.
Looking forward to seeing everyone this Wednesday! If you have not done so,
RSVP at https://lu.ma/kxi04g2m

We have 2 awesome presentations:
1. Jeremy Song (Principal Engineer, AWS Glue) will discuss Glue's Iceberg
table optimizations—covering topics like compaction, snapshot expiration,
and orphan file deletion. He’ll share recent improvements to the OSS
Iceberg maintenance jobs and highlight areas for future collaboration with
the community.
2. I will be speaking about PyIceberg! Python + Iceberg: Developer's Guide
to PyIceberg. The talk will cover the overview of the `iceberg-python`
project and explore some powerful features that can be integrated into your
current data stack.

Talks will be recorded and uploaded to YouTube (
https://www.youtube.com/@IcebergMeetup)

Best,
Kevin Liu

On Fri, Oct 18, 2024 at 4:45 PM Kevin Liu  wrote:

> Thanks, Jonathan. Looking forward to seeing everyone!
>
> Please let us know if you would like to present by filling out this form
>
> https://docs.google.com/forms/d/1vic-6nUYbUTsf_WmyQ0kB_7BrSdpT97Rfh2cL3cs2kw/edit
>
> Best,
> Kevin Liu
>
> On Fri, Oct 18, 2024 at 4:16 PM Jonathan Leang 
> wrote:
>
>> Hi everyone,
>>
>> We're continuing to do community meetups in the Seattle area! Details
>> are below:
>>
>> Connect with fellow enthusiasts, share insights, and dive into the latest
>> developments in the Apache Iceberg ecosystem! Whether you're a seasoned pro
>> or new to Apache Iceberg, this meetup is the perfect place to exchange
>> ideas and spark innovation.
>>
>> We will be hosting the event on November 13th from 5:00 PM to 8:30 PM in
>> Bellevue. Please RSVP using this link: <https://lu.ma/kxi04g2m>
>>
>> In this meetup we are looking to host a couple talks, so if you're
>> working on something or want to share an idea please respond to the call
>> for talks in the registration list above!
>>
>> See you there!
>> Jonathan Leang
>>
>

Re: [DISCUSS][Go] First release of iceberg-go

2024-11-08 Thread Kevin Liu

Oh this looks great! Very well documented.

I just went through the release process for PyIceberg. I should have the
proper permissions (the KEYS took me a while to set up).
Happy to help run these commands :)

Best,
Kevin Liu

On Fri, Nov 8, 2024 at 9:46 AM Matt Topol  wrote:

> Thanks guys!
>
> @Kevin: the release process is already all documented at
> https://github.com/apache/iceberg-go/tree/main/dev/release :)
>
> --Matt
>
> On Fri, Nov 8, 2024, 6:33 PM Kevin Liu  wrote:
>
>> Hi Matt,
>>
>> Happy to be a reviewer too. I don't know enough about the Go ecosystem to
>> be a release manager. I hope we can document the process for others in the
>> future.
>>
>> Best,
>> Kevin Liu
>>
>> On Fri, Nov 8, 2024 at 6:47 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Matt
>>>
>>> It sounds good to me. I will be happy to review the first release :)
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Nov 8, 2024 at 2:14 PM Matt Topol 
>>> wrote:
>>> >
>>> > Hey all,
>>> >
>>> > With the merging of basic read support [1] among other features, I
>>> propose we've hit a minimum threshold that it makes sense to do a v0.1.0
>>> release of the Go implementation of Iceberg.
>>> >
>>> > Would anyone be opposed to this idea? Since I'm not a committer,
>>> someone else (likely Fokko or Eduard) would have to use the release scripts
>>> to create the RC.
>>> >
>>> > Hopefully we can get a consensus opinion on this and get the first
>>> official release of the iceberg-go library! :)
>>> >
>>> > Thanks everyone,
>>> > --Matt
>>> >
>>> >
>>> >
>>> > [1]:
>>> https://github.com/apache/iceberg-go/commit/ac5c84d0a6e79ad66978e3e661eedb6f49edffda
>>>
>>

Re: [DISCUSS][Go] First release of iceberg-go

2024-11-08 Thread Kevin Liu

Hi Matt,

Happy to be a reviewer too. I don't know enough about the Go ecosystem to
be a release manager. I hope we can document the process for others in the
future.

Best,
Kevin Liu

On Fri, Nov 8, 2024 at 6:47 AM Jean-Baptiste Onofré  wrote:

> Hi Matt
>
> It sounds good to me. I will be happy to review the first release :)
>
> Regards
> JB
>
> On Fri, Nov 8, 2024 at 2:14 PM Matt Topol  wrote:
> >
> > Hey all,
> >
> > With the merging of basic read support [1] among other features, I
> propose we've hit a minimum threshold that it makes sense to do a v0.1.0
> release of the Go implementation of Iceberg.
> >
> > Would anyone be opposed to this idea? Since I'm not a committer, someone
> else (likely Fokko or Eduard) would have to use the release scripts to
> create the RC.
> >
> > Hopefully we can get a consensus opinion on this and get the first
> official release of the iceberg-go library! :)
> >
> > Thanks everyone,
> > --Matt
> >
> >
> >
> > [1]:
> https://github.com/apache/iceberg-go/commit/ac5c84d0a6e79ad66978e3e661eedb6f49edffda
>

Re: [DISCUSS] Add a implementation status page for iceberg

2024-11-08 Thread Kevin Liu

Hi Renjie,

I absolutely love this idea! I wanted to do something similar while working
on the Pyiceberg roadmap. It would make sense to include this for all
projects.

Some benefits that I see,
* Easily and quickly answer questions for "Is X supported in <> project?"
* Roadmap for feature parity
* Keep all the libraries in sync in terms of core features
* As we add features to one library, we can track support in other
libraries as well

+1 to having it as a link in https://iceberg.apache.org/. I like that Arrow
use https://arrow.apache.org/docs/status.html

Looking forward to making this happen! And happy to help in any way.

Best,
Kevin Liu

On Fri, Nov 8, 2024 at 6:33 AM Russell Spitzer 
wrote:

> Sounds like a great idea to me
>
> On Fri, Nov 8, 2024 at 7:58 AM Renjie Liu  wrote:
>
>> Hi:
>>
>> As iceberg evolved to a multi-lang project, I would like to propose to
>> maintain a status page for iceberg. For more details, please refer to this
>> doc
>> <https://docs.google.com/document/d/1sRsTatGQJJNiBiQZNUW4VwQDCV1e75BHM6cSPla4vBU/edit?usp=sharing>.
>> Welcome to join the discussion and comment on it!
>>
>>
>>

Re: [DISCUSS] Duplicate KEYS files

2024-11-11 Thread Kevin Liu

+1 (non-binding)
Here are some places in the code base we would need to update.
https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true
I also double-checked against the "/KEYS" search, seems like we've captured
all the necessary changes above
https://grep.app/search?q=/KEYS&case=true&filter[repo.pattern][0]=iceberg

Best,
Kevin Liu

On Mon, Nov 11, 2024 at 9:45 AM Matt Topol  wrote:

> +1 (non-binding) for merging, I can update the docs on the iceberg Go
> release README after it's done!
>
> On Mon, Nov 11, 2024, 12:20 PM Yufei Gu  wrote:
>
>> +1 merging sounds good. It should still work for previous releases.
>>
>> Yufei
>>
>>
>> On Mon, Nov 11, 2024 at 7:46 AM Xuanwo  wrote:
>>
>>> Hi
>>>
>>> Thank you, Fokko, for proposing this. Here is my +1, non-binding.
>>>
>>> I'd also like to mention that as part of the ASF release policy, we must
>>> refer to "https://downloads.apache.org/iceberg/KEYS"; for KEYS; other
>>> links are not allowed.
>>>
>>> Ref: https://infra.apache.org/release-download-pages.html#links
>>>
>>> On Mon, Nov 11, 2024, at 23:32, Russell Spitzer wrote:
>>>
>>> Sounds good to me, although I guess it's really just up to the Rust and
>>> GO maintainers to converge
>>>
>>> On Mon, Nov 11, 2024 at 9:13 AM Fokko Driesprong 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> While looking at the release steps for iceberg-go
>>> <https://github.com/apache/iceberg-go/tree/main/dev/release>, I noticed
>>> that we have two KEYS files:
>>>
>>>- https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>- https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also
>>>available through https://downloads.apache.org/iceberg/KEYS)
>>>
>>> The first one is referenced by Java
>>> <https://iceberg.apache.org/how-to-release/#setup> and Python
>>> <https://py.iceberg.apache.org/verify-release/#verifying-signatures>,
>>> and the last one by Rust <https://rust.iceberg.apache.org/release.html>.
>>> As mentioned earlier, Go references them both. Should we consolidate these?
>>> My suggestion would be to merge the `/dev/` ones into the `release` ones,
>>> and get rid of the one in `dev`. Thoughts?
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>

Re: [DISCUSS] Duplicate KEYS files

2024-11-11 Thread Kevin Liu

Does anyone know how to edit the KEYS file at
https://downloads.apache.org/iceberg/KEYS?
The previous instruction [1] uses `svn` which doesn't work with the above
URL.
```
svn co https://dist.apache.org/repos/dist/dev/iceberg icebergsvn # works
svn co https://downloads.apache.org/iceberg icebergsvn # fails
```

Best,
Kevin Liu

[1]
https://github.com/apache/iceberg-python/pull/1315/files#diff-4e59fab01a772c9c96d23c5ca8a2cce6e1500a71bdf32ad4cb8bec6319d0bdf0L88

On Mon, Nov 11, 2024 at 5:47 PM Renjie Liu  wrote:

> +1 (binding) for merging.
>
> On Tue, Nov 12, 2024 at 1:56 AM Kevin Liu  wrote:
>
>> +1 (non-binding)
>> Here are some places in the code base we would need to update.
>> https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true
>> I also double-checked against the "/KEYS" search, seems like we've
>> captured all the necessary changes above
>> https://grep.app/search?q=/KEYS&case=true&filter[repo.pattern][0]=iceberg
>>
>> Best,
>> Kevin Liu
>>
>> On Mon, Nov 11, 2024 at 9:45 AM Matt Topol 
>> wrote:
>>
>>> +1 (non-binding) for merging, I can update the docs on the iceberg Go
>>> release README after it's done!
>>>
>>> On Mon, Nov 11, 2024, 12:20 PM Yufei Gu  wrote:
>>>
>>>> +1 merging sounds good. It should still work for previous releases.
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Mon, Nov 11, 2024 at 7:46 AM Xuanwo  wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Thank you, Fokko, for proposing this. Here is my +1, non-binding.
>>>>>
>>>>> I'd also like to mention that as part of the ASF release policy, we
>>>>> must refer to "https://downloads.apache.org/iceberg/KEYS"; for KEYS;
>>>>> other links are not allowed.
>>>>>
>>>>> Ref: https://infra.apache.org/release-download-pages.html#links
>>>>>
>>>>> On Mon, Nov 11, 2024, at 23:32, Russell Spitzer wrote:
>>>>>
>>>>> Sounds good to me, although I guess it's really just up to the Rust
>>>>> and GO maintainers to converge
>>>>>
>>>>> On Mon, Nov 11, 2024 at 9:13 AM Fokko Driesprong 
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> While looking at the release steps for iceberg-go
>>>>> <https://github.com/apache/iceberg-go/tree/main/dev/release>, I
>>>>> noticed that we have two KEYS files:
>>>>>
>>>>>- https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>>- https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also
>>>>>available through https://downloads.apache.org/iceberg/KEYS)
>>>>>
>>>>> The first one is referenced by Java
>>>>> <https://iceberg.apache.org/how-to-release/#setup> and Python
>>>>> <https://py.iceberg.apache.org/verify-release/#verifying-signatures>,
>>>>> and the last one by Rust
>>>>> <https://rust.iceberg.apache.org/release.html>. As mentioned earlier,
>>>>> Go references them both. Should we consolidate these? My suggestion would
>>>>> be to merge the `/dev/` ones into the `release` ones, and get rid of the
>>>>> one in `dev`. Thoughts?
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>> Xuanwo
>>>>>
>>>>> https://xuanwo.io/
>>>>>
>>>>>

Re: [DISCUSS] Duplicate KEYS files

2024-11-11 Thread Kevin Liu

Oh yeah, thank you!

On Mon, Nov 11, 2024 at 6:20 PM Xuanwo  wrote:

> Hi, Kevin
>
> https://downloads.apache.org/iceberg points to
> https://dist.apache.org/repos/dist/release/iceberg
> <https://dist.apache.org/repos/dist/release/iceberg/KEYS> so we don't
> need to edit it by hand.
>
> On Tue, Nov 12, 2024, at 10:16, Kevin Liu wrote:
>
> Does anyone know how to edit the KEYS file at
> https://downloads.apache.org/iceberg/KEYS?
> The previous instruction [1] uses `svn` which doesn't work with the above
> URL.
> ```
> svn co https://dist.apache.org/repos/dist/dev/iceberg icebergsvn # works
> svn co https://downloads.apache.org/iceberg icebergsvn # fails
> ```
>
> Best,
> Kevin Liu
>
> [1]
> https://github.com/apache/iceberg-python/pull/1315/files#diff-4e59fab01a772c9c96d23c5ca8a2cce6e1500a71bdf32ad4cb8bec6319d0bdf0L88
>
> On Mon, Nov 11, 2024 at 5:47 PM Renjie Liu 
> wrote:
>
> +1 (binding) for merging.
>
> On Tue, Nov 12, 2024 at 1:56 AM Kevin Liu  wrote:
>
> +1 (non-binding)
> Here are some places in the code base we would need to update.
> https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true
> I also double-checked against the "/KEYS" search, seems like we've
> captured all the necessary changes above
> https://grep.app/search?q=/KEYS&case=true&filter[repo.pattern][0]=iceberg
>
> Best,
> Kevin Liu
>
> On Mon, Nov 11, 2024 at 9:45 AM Matt Topol  wrote:
>
> +1 (non-binding) for merging, I can update the docs on the iceberg Go
> release README after it's done!
>
> On Mon, Nov 11, 2024, 12:20 PM Yufei Gu  wrote:
>
> +1 merging sounds good. It should still work for previous releases.
>
> Yufei
>
>
> On Mon, Nov 11, 2024 at 7:46 AM Xuanwo  wrote:
>
>
> Hi
>
> Thank you, Fokko, for proposing this. Here is my +1, non-binding.
>
> I'd also like to mention that as part of the ASF release policy, we must
> refer to "https://downloads.apache.org/iceberg/KEYS"; for KEYS; other
> links are not allowed.
>
> Ref: https://infra.apache.org/release-download-pages.html#links
>
> On Mon, Nov 11, 2024, at 23:32, Russell Spitzer wrote:
>
> Sounds good to me, although I guess it's really just up to the Rust and GO
> maintainers to converge
>
> On Mon, Nov 11, 2024 at 9:13 AM Fokko Driesprong  wrote:
>
> Hi everyone,
>
> While looking at the release steps for iceberg-go
> <https://github.com/apache/iceberg-go/tree/main/dev/release>, I noticed
> that we have two KEYS files:
>
>- https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>- https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also
>available through https://downloads.apache.org/iceberg/KEYS)
>
> The first one is referenced by Java
> <https://iceberg.apache.org/how-to-release/#setup> and Python
> <https://py.iceberg.apache.org/verify-release/#verifying-signatures>, and
> the last one by Rust <https://rust.iceberg.apache.org/release.html>. As
> mentioned earlier, Go references them both. Should we consolidate these? My
> suggestion would be to merge the `/dev/` ones into the `release` ones, and
> get rid of the one in `dev`. Thoughts?
>
> Kind regards,
> Fokko
>
> Xuanwo
>
> https://xuanwo.io/
>
> Xuanwo
>
> https://xuanwo.io/
>
>

Re: [DISCUSS] Duplicate KEYS files

2024-11-12 Thread Kevin Liu

> https://downloads.apache.org/iceberg points to
https://dist.apache.org/repos/dist/release/iceberg so we don't need to edit
it by hand.

It looks like the two files are different. For example, search for "Matt
Topol", it only appears in the `dist/release` but not in `downloads`.
https://downloads.apache.org/iceberg/KEYS
https://dist.apache.org/repos/dist/release/iceberg/KEYS

Best,
Kevin Liu


On Tue, Nov 12, 2024 at 9:02 AM Kevin Liu  wrote:

> Hey folks,
>
> As mentioned in the previous email. Here are the references to
> `dist/dev/iceberg/KEYS`
> https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true
>
> There are 3 repos mentioned. Here are the corresponding PRs:
> - iceberg-python https://github.com/apache/iceberg-python/pull/1315
> - iceberg-go https://github.com/apache/iceberg-go/pull/200
> - iceberg https://github.com/apache/iceberg/pull/11526
>
> We will probably want to merge and remove the dev KEYS first.
>
> Thanks,
> Kevin Liu
>
> On Mon, Nov 11, 2024 at 11:52 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Fokko
>>
>> As we discussed about that together on Slack, I'm fine merging and
>> removing the dev located KEYS file.
>>
>> Regards
>> JB
>>
>> On Mon, Nov 11, 2024 at 4:13 PM Fokko Driesprong 
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > While looking at the release steps for iceberg-go, I noticed that we
>> have two KEYS files:
>> >
>> > https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> > https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also
>> available through https://downloads.apache.org/iceberg/KEYS)
>> >
>> > The first one is referenced by Java and Python, and the last one by
>> Rust. As mentioned earlier, Go references them both. Should we consolidate
>> these? My suggestion would be to merge the `/dev/` ones into the `release`
>> ones, and get rid of the one in `dev`. Thoughts?
>> >
>> > Kind regards,
>> > Fokko
>>
>

Re: [DISCUSS] Duplicate KEYS files

2024-11-12 Thread Kevin Liu

Hey folks,

As mentioned in the previous email. Here are the references to
`dist/dev/iceberg/KEYS`
https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true

There are 3 repos mentioned. Here are the corresponding PRs:
- iceberg-python https://github.com/apache/iceberg-python/pull/1315
- iceberg-go https://github.com/apache/iceberg-go/pull/200
- iceberg https://github.com/apache/iceberg/pull/11526

We will probably want to merge and remove the dev KEYS first.

Thanks,
Kevin Liu

On Mon, Nov 11, 2024 at 11:52 PM Jean-Baptiste Onofré 
wrote:

> Hi Fokko
>
> As we discussed about that together on Slack, I'm fine merging and
> removing the dev located KEYS file.
>
> Regards
> JB
>
> On Mon, Nov 11, 2024 at 4:13 PM Fokko Driesprong  wrote:
> >
> > Hi everyone,
> >
> > While looking at the release steps for iceberg-go, I noticed that we
> have two KEYS files:
> >
> > https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> > https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also available
> through https://downloads.apache.org/iceberg/KEYS)
> >
> > The first one is referenced by Java and Python, and the last one by
> Rust. As mentioned earlier, Go references them both. Should we consolidate
> these? My suggestion would be to merge the `/dev/` ones into the `release`
> ones, and get rid of the one in `dev`. Thoughts?
> >
> > Kind regards,
> > Fokko
>

Re: [DISCUSS] Duplicate KEYS files

2024-11-12 Thread Kevin Liu

Huh, super weird. I also see it using curl
```
curl https://downloads.apache.org/iceberg/KEYS | grep Topol
```
but using my web browser directly doesn't show that key for some reason...
[image: Screenshot 2024-11-12 at 9.40.06 AM.jpg]

The hash is the same for the two files. So I guess they are the same.
```
➜  ~ curl -s https://downloads.apache.org/iceberg/KEYS | md5sum
905987ebcc39a70ebcbce89f1939fe26  -
➜  ~ curl -s https://dist.apache.org/repos/dist/release/iceberg/KEYS |
md5sum
905987ebcc39a70ebcbce89f1939fe26  -
```

Best,
Kevin Liu




On Tue, Nov 12, 2024 at 9:36 AM Russell Spitzer 
wrote:

> I see it in downloads?
>
> ➜  icebergsvnrelease git:(master) ✗ curl
> https://downloads.apache.org/iceberg/KEYS | grep Topol
> uid   [ultimate] Matt Topol 
> sig 34B86A1E5E59C8B81 2024-10-10  Matt Topol  >
> uid   [ultimate] Matthew Topol 
> sig 34B86A1E5E59C8B81 2023-06-12  Matt Topol  >
> sig  4B86A1E5E59C8B81 2023-06-12  Matt Topol  >
>
>
>
> ➜  icebergsvnrelease git:(master) ✗ grep Topol KEYS
> uid   [ultimate] Matt Topol 
> sig 34B86A1E5E59C8B81 2024-10-10  Matt Topol  >
> uid   [ultimate] Matthew Topol 
> sig 34B86A1E5E59C8B81 2023-06-12  Matt Topol  >
> sig  4B86A1E5E59C8B81 2023-06-12  Matt Topol  >
>
> On Tue, Nov 12, 2024 at 11:31 AM Kevin Liu  wrote:
>
>> > https://downloads.apache.org/iceberg points to
>> https://dist.apache.org/repos/dist/release/iceberg so we don't need to
>> edit it by hand.
>>
>> It looks like the two files are different. For example, search for "Matt
>> Topol", it only appears in the `dist/release` but not in `downloads`.
>> https://downloads.apache.org/iceberg/KEYS
>> https://dist.apache.org/repos/dist/release/iceberg/KEYS
>>
>> Best,
>> Kevin Liu
>>
>>
>> On Tue, Nov 12, 2024 at 9:02 AM Kevin Liu  wrote:
>>
>>> Hey folks,
>>>
>>> As mentioned in the previous email. Here are the references to
>>> `dist/dev/iceberg/KEYS`
>>> https://grep.app/search?q=dist.apache.org/repos/dist/.%2A/iceberg/KEYS®exp=true
>>>
>>> There are 3 repos mentioned. Here are the corresponding PRs:
>>> - iceberg-python https://github.com/apache/iceberg-python/pull/1315
>>> - iceberg-go https://github.com/apache/iceberg-go/pull/200
>>> - iceberg https://github.com/apache/iceberg/pull/11526
>>>
>>> We will probably want to merge and remove the dev KEYS first.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Mon, Nov 11, 2024 at 11:52 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Hi Fokko
>>>>
>>>> As we discussed about that together on Slack, I'm fine merging and
>>>> removing the dev located KEYS file.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Mon, Nov 11, 2024 at 4:13 PM Fokko Driesprong 
>>>> wrote:
>>>> >
>>>> > Hi everyone,
>>>> >
>>>> > While looking at the release steps for iceberg-go, I noticed that we
>>>> have two KEYS files:
>>>> >
>>>> > https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>> > https://dist.apache.org/repos/dist/release/iceberg/KEYS (Also
>>>> available through https://downloads.apache.org/iceberg/KEYS)
>>>> >
>>>> > The first one is referenced by Java and Python, and the last one by
>>>> Rust. As mentioned earlier, Go references them both. Should we consolidate
>>>> these? My suggestion would be to merge the `/dev/` ones into the `release`
>>>> ones, and get rid of the one in `dev`. Thoughts?
>>>> >
>>>> > Kind regards,
>>>> > Fokko
>>>>
>>>

Re: [DISCUSS] Duplicate KEYS files

2024-11-12 Thread Kevin Liu

> Kevin, can you flush the caches of your browser? Maybe that's the
problem. I'm seeing Matt in there (added him yesterday :).

Yep, flushing the cache worked. Thanks!

> downloads.a.o is behind a CDN, so it might take a while for changes from
SVN to take effect.

Its updated now :)
```
➜  ~ curl -s https://dist.apache.org/repos/dist/release/iceberg/KEYS |
md5sum
dff3353c6998d0897742dcc5c04662d3  -
➜  ~ curl -s https://downloads.apache.org/iceberg/KEYS | md5sum
dff3353c6998d0897742dcc5c04662d3  -
```

Thanks, Russell for merging the KEYS and Fokko, Xuanwo, and Matt for the
review.
These 3 PRs are all merged
- iceberg-python https://github.com/apache/iceberg-python/pull/1315
- iceberg-go https://github.com/apache/iceberg-go/pull/200
- iceberg https://github.com/apache/iceberg/pull/11526

I double-checked with a GitHub search and looks like all the references are
updated
https://github.com/search?q=%22dist.apache.org%2Frepos%2Fdist%2Fdev%2Ficeberg%2FKEYS%22+org%3Aapache&type=code

I also double-checked that the dev KEYS are removed (had to flush the cache
again)
https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Best,
Kevin Liu

On Tue, Nov 12, 2024 at 11:32 AM Russell Spitzer 
wrote:

> Yep sorry I did the merge already of the KEY files
>
> On Tue, Nov 12, 2024 at 1:22 PM Fokko Driesprong  wrote:
>
>> Looks like dev/KEYS ⊂ release/KEYS: https://www.diffchecker.com/4oxGhphl/
>>
>> I've removed the old one. Thanks, everyone, and thanks Kevin for raising
>> the PRs, let's get those in!
>>
>> Kind regards,
>> Fokko
>>
>>
>> Op di 12 nov 2024 om 20:03 schreef Fokko Driesprong :
>>
>>> There is a consensus on merging them. Let me go ahead and do this.
>>>
>>> Kevin, can you flush the caches of your browser? Maybe that's
>>> the problem. I'm seeing Matt in there (added him yesterday :).
>>>
>>> Cheers, Fokko
>>>
>>>
>>>
>>> Op di 12 nov 2024 om 19:48 schreef Xuanwo :
>>>
>>>> Hi,
>>>>
>>>> downloads.a.o is behind a CDN, so it might take a while for changes
>>>> from SVN to take effect.
>>>>
>>>> On Wed, Nov 13, 2024, at 01:43, Kevin Liu wrote:
>>>>
>>>> Huh, super weird. I also see it using curl
>>>> ```
>>>> curl https://downloads.apache.org/iceberg/KEYS | grep Topol
>>>> ```
>>>> but using my web browser directly doesn't show that key for some
>>>> reason...
>>>> [image: Screenshot 2024-11-12 at 9.40.06 AM.jpg]
>>>>
>>>> The hash is the same for the two files. So I guess they are the same.
>>>> ```
>>>> ➜  ~ curl -s https://downloads.apache.org/iceberg/KEYS | md5sum
>>>> 905987ebcc39a70ebcbce89f1939fe26  -
>>>> ➜  ~ curl -s https://dist.apache.org/repos/dist/release/iceberg/KEYS |
>>>> md5sum
>>>> 905987ebcc39a70ebcbce89f1939fe26  -
>>>> ```
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 12, 2024 at 9:36 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>> I see it in downloads?
>>>>
>>>> ➜  icebergsvnrelease git:(master) ✗ curl
>>>> https://downloads.apache.org/iceberg/KEYS | grep Topol
>>>> uid   [ultimate] Matt Topol 
>>>> sig 34B86A1E5E59C8B81 2024-10-10  Matt Topol <
>>>> zerosh...@apache.org>
>>>> uid   [ultimate] Matthew Topol 
>>>> sig 34B86A1E5E59C8B81 2023-06-12  Matt Topol <
>>>> zerosh...@apache.org>
>>>> sig  4B86A1E5E59C8B81 2023-06-12  Matt Topol <
>>>> zerosh...@apache.org>
>>>>
>>>>
>>>>
>>>> ➜  icebergsvnrelease git:(master) ✗ grep Topol KEYS
>>>> uid   [ultimate] Matt Topol 
>>>> sig 3    4B86A1E5E59C8B81 2024-10-10  Matt Topol <
>>>> zerosh...@apache.org>
>>>> uid   [ultimate] Matthew Topol 
>>>> sig 34B86A1E5E59C8B81 2023-06-12  Matt Topol <
>>>> zerosh...@apache.org>
>>>> sig  4B86A1E5E59C8B81 2023-06-12  Matt Topol <
>>>> zerosh...@apache.org>
>>>>
>>>> On Tue, Nov 12, 2024 at 11:31 AM Kevin Liu 
>>>> wrote:
>>>>
>>>> > https://downloads.apache.org/iceberg points to
>>>> https://dist.apache.org/repos/dist/release/iceberg so we don't need to
>>>>

Re: [VOTE][Go] Release Apache Iceberg Go v0.1.0 RC1

2024-11-13 Thread Kevin Liu

Hey Matt,

-1 (non binding) We need a new RC due to the issues below.

I found a few issues while verifying the RC.

The first one is minor. The "[2]:" text mentions rc1 but links to rc0.

The second is blocking. While running the verification script, I found that
the sha256 file contains the wrong content. See the file here [1]. The file
name is wrong in the sha256 file, notice the extra `-rc1`
```
750f4f6593368d6ea4d79559d1491c0dcea1ab0ca6130bdd3401e607d7f837b0
 apache-iceberg-go-0.1.0-rc1-rc1.tar.gz
```
I believe this was introduced by a recent PR [2]

Here's a PR to fix the issue [3], I've verified using my own forked repo.
Sorry for the churn here, I should have tested the previous PR more closely.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg-go/releases/tag/v0.1.0-rc1
[2]
https://github.com/apache/iceberg-go/commit/8ca0bb7bdf7868bac4aca83f50db047d9bf0277c#diff-2ff8a59693d8d05dd67560f01c2a49d9e05f4b1ccefeb91f8486744532214e44R94
[3] https://github.com/apache/iceberg-go/pull/202

On Wed, Nov 13, 2024 at 8:46 AM Matt Topol  wrote:

> Slight fix to the previous email: the text should say "The source release
> *rc1* is hosted at [2]"
>
> The link is correct, it's only the text that needed to be updated.
>
> On Wed, Nov 13, 2024 at 11:43 AM Matt Topol 
> wrote:
>
>> Hi,
>>
>> I would like to propose the following release candidate (RC1) of Apache
>> Iceberg Go version v0.1.0.
>>
>> This release candidate is based on commit:
>> 8ca0bb7bdf7868bac4aca83f50db047d9bf0277c [1]
>>
>> The source release rc0 is hosted at [2].
>>
>> Please download, verify checksums and signatures, run the unit tests, and
>> vote on the release. See [3] for how to validate a release candidate.
>>
>> The vote will be open for at least 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg Go v0.1.0
>> [ ] +0
>> [ ] -1 Do not release this as Apache Iceberg Go v0.1.0 because...
>>
>> Thanks!
>> --Matt
>>
>> [1]:
>> https://github.com/apache/iceberg-go/commit/8ca0bb7bdf7868bac4aca83f50db047d9bf0277c
>> [2]: https://github.com/apache/iceberg-go/releases/v0.1.0-rc1
>> <https://github.com/apache/iceberg-go/releases/v0.1.0-rc0>
>> [3]:
>> https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
>>
>

Re: [ANNOUNCE] Apache Iceberg release 1.7.0

2024-11-08 Thread Kevin Liu

Woot! Thank you Russell for driving the release!

Best,
Kevin Liu

On Fri, Nov 8, 2024 at 8:43 AM Russell Spitzer 
wrote:

> Yes that would be part of the docs that I am still in the process of
> updating
>
> On Fri, Nov 8, 2024 at 10:22 AM Rodrigo Meneses 
> wrote:
>
>> Thanks so much Russell for driving this. I just checked
>> https://iceberg.apache.org/releases/ and it looks like the 1.7 release
>> info is missing from there:
>> Currently it still says "The latest version of Iceberg is 1.6.1
>> <https://github.com/apache/iceberg/releases/tag/apache-iceberg-1.6.1>."
>>
>>
>> On Fri, Nov 8, 2024 at 7:33 AM Russell Spitzer 
>> wrote:
>>
>>> I'm pleased to announce the release of Apache Iceberg 1.7.0!
>>>
>>> Apache Iceberg is an open table format for huge analytic datasets.
>>> Iceberg
>>> delivers high query performance for tables with tens of petabytes of
>>> data,
>>> along with atomic commits, concurrent writes, and SQL-compatible table
>>> evolution.
>>>
>>> This release can be downloaded from:
>>> https://dlcdn.apache.org/iceberg/apache-iceberg-1.7.0/apache-iceberg-1.7.0.tar.gz
>>>
>>> Doc Release notes: https://iceberg.apache.org/releases/#170-release
>>>
>>> Github Release Notes:
>>> https://github.com/apache/iceberg/releases/tag/apache-iceberg-1.7.0
>>>
>>> Java artifacts are available from Maven Central.
>>>
>>> Thanks to everyone for contributing!
>>>
>>

Re: [VOTE] Drop Python3.8 Support in PyIceberg 0.8.0

2024-09-23 Thread Kevin Liu

+1 non-binding. Thanks for starting this conversation!


On Fri, Sep 20, 2024 at 2:02 PM Sung Yun  wrote:

> Hi folks,
>
> I'd like to start this thread to vote on dropping the support for
> Python3.8 in the upcoming 0.8.0 PyIceberg release.
>
> Python3.8 will be End-Of-Life in October 2024, and some of our
> dependencies have already dropped support for Python3.8 prebuilt
> wheels which makes our dependency management more complicated than
> needed.
>
> https://devguide.python.org/versions/
>
> Sung
>

Re: [Notice] Update to catalog sync meeting timezone 2

2024-09-24 Thread Kevin Liu

https://docs.google.com/document/d/1iPGVCIcr-M0XtAiudOguWAvmqIdVgpYN5vz5ohO8PKw/edit
This doc includes the calendar for the catalog sync and notes from past
syncs.

Best,
Kevin Liu

On Tue, Sep 24, 2024 at 8:15 AM Sung Yun  wrote:

> Hi Jack Ye, thank you for the update !
>
> This may be a silly question, but where can I find the Catalog sync
> calendar and it's meeting details? Does it share the same google meets link
> as the Iceberg community sync?
>
> Sung
>
> On 2024/09/24 15:11:10 Jack Ye wrote:
> > Hi everyone,
> >
> > Due to the low attendance of the time zone 2 meeting for Iceberg catalog
> > syncs, after discussion with people during the last sync, as well as
> > talking to people who were interested in alternative timezone, I am
> > updating the timezone 2 meeting back to 9am pacific time on Wednesday.
> >
> > This means that going forward, we will have a catalog sync at 9am pacific
> > time Wednesday for 2 weeks, followed by the Iceberg community sync the
> next
> > week, in a 3 week cycle.
> >
> > If you are in a different time zone that is difficult to join the
> meeting,
> > and you have important topics that you would like to discuss, feel free
> to
> > raise it in the devlist, or I can also help with coordinating a specific
> > meeting for discussion at a different time.
> >
> > Looking forward to seeing everyone!
> >
> > Best,
> > Jack Ye
> >
>

Clarification on DayTransform Result Type

2024-09-26 Thread Kevin Liu

Hey folks,

While reviewing a PR to fix DayTransform in PyIceberg (#1208
), we found an
inconsistency between the spec and the Java Iceberg library.

According to the spec
, the result type
for the "day partition transform" should be `int`, similar to other
time-based partition transforms (year/month/hour). However, in the Java
Iceberg library, the result type for day partition transform is `DateType` (
source
).
This seems to be a discrepancy from the spec, as the day partition
transform is the only time-based transform with a non-int result
type—whereas the others use IntegerType (source

).

Could someone confirm if my understanding is correct? If so, is there any
historical context for this difference? Lastly, how should we approach
resolving this moving forward?

Best,
Kevin

Re: [DISCUSS] Iceberg Summit 2025 ?

2024-09-30 Thread Kevin Liu

+1 to hybrid event with an in-person element.

Things I like to see:
* Real-world experience from companies running Iceberg at scale
* Iceberg catalogs and how it's used
* Integrations with open-source projects in the broader ecosystem
* Forward-looking statements for the direction of the Iceberg ecosystem
* Workshops, both for introduction to Iceberg and how to contribute

Looking forward to the summit! And happy to help in any way.

Best,
Kevin

On Mon, Sep 30, 2024 at 9:38 AM Yufei Gu  wrote:

> Thank you, JB, for taking the initiative to get the conversation started
> for the next Iceberg Summit!
>
> I’m really excited to see the community considering a hybrid event for
> 2025. Having the option for in-person interaction would definitely enhance
> the sense of connection among contributors and users, but maintaining the
> virtual option is essential for inclusivity. A hybrid format seems like a
> great way to reach as many people as possible.
>
> I echo the points on broadening the range of talks, especially around user
> stories and practitioners working with Iceberg in production. Hearing about
> real-world implementations and use cases is always insightful. Workshops
> would be a fantastic addition too, particularly for onboarding new users
> and contributors.
>
> I agree we should stay focused on Iceberg and its ecosystem, making the
> event as vendor-neutral and transparent as possible.
>
> Looking forward to seeing how this evolves, and happy to help with
> anything along the way!
>
>
> Yufei
>
>
> On Sun, Sep 29, 2024 at 11:56 PM Piotr Findeisen <
> piotr.findei...@gmail.com> wrote:
>
>> Hi
>>
>> Meeting in person is always the best, but online is much more inclusive.
>> So +1 for a hybrid event.
>>
>> Best
>> Piotr
>>
>>
>> On Mon, 30 Sept 2024 at 08:27, Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> +1 for a hybrid event
>>>
>>> On Sun, Sep 29, 2024 at 4:51 AM Steven Wu  wrote:
>>>
 +1 for hybrid with in-person elements.

 On Sat, Sep 28, 2024 at 4:23 PM Matt Topol 
 wrote:

> +1 from me as well, I would love to attend an in person/hybrid iceberg
> summit. Workshops seem like a perfect way to help the community.
>
> On Sat, Sep 28, 2024, 7:11 PM Honah J.  wrote:
>
>> +1 on hosting another Iceberg Summit in 2025! We had many great talks
>> last time, and I think it will be even better if we can have a hybrid 
>> mode
>> this time, as the in-person element can add value for deeper engagement 
>> and
>> networking.
>>
>> I’m particularly interested in incorporating workshops. We could
>> offer them at different levels—introductory, intermediate, and 
>> advanced—so
>> participants can attend sessions that best fit their background and
>> interests. Covering topics like basic usage, ecosystem integrations,
>> advanced features, and different language implementations would help
>> participants explore various aspects of the Iceberg project in a hands-on
>> way.
>>
>> Looking forward to more ideas from the community and happy to help
>> where needed!
>>
>> Best regards,
>> Honah
>>
>> On Fri, Sep 27, 2024 at 11:02 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I am really excited about the prospect of another Summit and also
>>> had a great time last year. I think we had a great selection of talks 
>>> and
>>> I'm hoping we can do so again.
>>>
>>> I'm very much in support of having an in person element, I would
>>> love to have a chance to talk face to face with other members of the
>>> community. I do think we should
>>> preserve online viewing as well since I know not everyone has the
>>> ability to travel.
>>>
>>> I do hope that we can have more talks about users with Iceberg in
>>> production as well. I think we did a really good job of covering Iceberg
>>> development last time but didn't
>>> have as many practitioner discussions as I would have liked. I also
>>> think it would be great if we had a section that was purely just "ideas 
>>> for
>>> Iceberg" where folks can pitch
>>> their features and proposals to a much broader audience.
>>>
>>> I also would love to have some workshops this time as well,
>>> showing folks how to use the project, how to make their first tables, 
>>> and
>>> how to contribute to the Iceberg project.
>>>
>>> Things I'd like to avoid: Sales pitches, Talks not focused on
>>> Iceberg or its ecosystem (Personally I don't really want to hear 
>>> anything
>>> about AI or LLMS but I know that might not be everyone). Ideally I would
>>> like this to be a vendor neutral event where planning is as transparent 
>>> as
>>> possible for the community.
>>>
>>> I'd love to hear what other folks are thinking,
>>> Russ
>>>
>>> On Fri, Sep 27, 2024 at 12:51 PM Jea

Re: Clarification on DayTransform Result Type

2024-09-30 Thread Kevin Liu

Thank you both for the insights and context.

As Russell pointed out, the "day partition transform" result is true of int
type. The Types.DateType
<https://github.com/apache/iceberg/blob/dddb5f423b353d961b8a08eb2cb4371d453c2959/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47>
corresponds
to TypeID.DATE
<https://github.com/apache/iceberg/blob/09370ddbc39fc3920fb8cbd3dff11b377dd37e40/api/src/main/java/org/apache/iceberg/types/Types.java#L181>,
which is also an Integer type
<https://github.com/apache/iceberg/blob/113c6e7d62e53d3e3cb15b1712f3a1db473ca940/api/src/main/java/org/apache/iceberg/types/Type.java#L37>.
So, this behavior conforms to the spec.

The issue with DayTransform in PyIceberg (#1208
<https://github.com/apache/iceberg-python/pull/1208>) is due to the changes
in the PR. The problem arises from how the partition value is displayed in
the partition metadata table. As Ryan mentioned, Spark displays the
partition value as `date`. However, the PR removes `DateType` as the
`result_type`, which causes PyIceberg to display the partition value as
`int` since the epoch.

> if we just change the type to `date`, engines could correctly display the
value

I found a related discussion in apache/iceberg/#279
<https://github.com/apache/iceberg/issues/279#issuecomment-521322801>,
specifically: "That will cause the partition tuple's field type to be a
date, which should also cause the metadata table to display formatted dates
instead of the day ordinal in Spark." I want to confirm my understanding:
is this behavior due to the Iceberg-to-Spark DateType conversion in `
<https://github.com/apache/iceberg/blob/main/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L103-L104>
TypeToSparkType`
<https://github.com/apache/iceberg/blob/09370ddbc39fc3920fb8cbd3dff11b377dd37e40/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L103-L104>
?

Best,
Kevin

On Fri, Sep 27, 2024 at 1:52 PM rdb...@gmail.com  wrote:

> The background is that the result of the day function and dates are
> basically the same: the number of days from the Unix epoch. When we started
> using metadata tables, we realized that a lot of people use the day
> function but then get a weird ordinal value out, but if we just change the
> type to `date`, engines could correctly display the value. This isn't
> required by the spec, it's just a convenience.
>
> On Fri, Sep 27, 2024 at 8:30 AM Russell Spitzer 
> wrote:
>
>> Good thing DateType is an Integer :)
>> https://github.com/apache/iceberg/blob/113c6e7d62e53d3e3cb15b1712f3a1db473ca940/api/src/main/java/org/apache/iceberg/types/Type.java#L37
>>
>> On Thu, Sep 26, 2024 at 8:38 PM Kevin Liu  wrote:
>>
>>> Hey folks,
>>>
>>> While reviewing a PR to fix DayTransform in PyIceberg (#1208
>>> <https://github.com/apache/iceberg-python/pull/1208>), we found an
>>> inconsistency between the spec and the Java Iceberg library.
>>>
>>> According to the spec
>>> <https://iceberg.apache.org/spec/#partition-transforms>, the result
>>> type for the "day partition transform" should be `int`, similar to other
>>> time-based partition transforms (year/month/hour). However, in the Java
>>> Iceberg library, the result type for day partition transform is `DateType` (
>>> source
>>> <https://github.com/apache/iceberg/blob/dddb5f423b353d961b8a08eb2cb4371d453c2959/api/src/main/java/org/apache/iceberg/transforms/Days.java#L47>).
>>> This seems to be a discrepancy from the spec, as the day partition
>>> transform is the only time-based transform with a non-int result
>>> type—whereas the others use IntegerType (source
>>> <https://grep.app/search?q=getResultType&filter[repo][0]=apache/iceberg&filter[path][0]=api/src/main/java/org/apache/iceberg/>
>>> ).
>>>
>>> Could someone confirm if my understanding is correct? If so, is there
>>> any historical context for this difference? Lastly, how should we approach
>>> resolving this moving forward?
>>>
>>> Best,
>>> Kevin
>>>
>>>

Re: [VOTE] Release Apache Iceberg 1.7.0 RC1

2024-11-07 Thread Kevin Liu

+1 non-binding

Verified signatures, checksums, and license

Ran build and tests with JDK17


Best,

Kevin Liu

On Thu, Nov 7, 2024 at 7:24 AM Prashant Singh 
wrote:

> Thank you Russell !
>
> +1 (non-binding)
> - Verified signature, checksum, license, build.
> - ran our internal services iceberg integration tests with JDK 17
> - manually tested spark-sql
>
> Thanks,
> Prashant Singh
>
> On Thu, Nov 7, 2024 at 12:32 AM Fokko Driesprong  wrote:
>
>> Thanks Russel for running this release!
>>
>> +1 (binding)
>>
>> Checked signatures, checksum, licenses and did some local testing.
>>
>> Kind regards,
>> Fokko
>>
>> Op do 7 nov 2024 om 08:35 schreef Eduard Tudenhöfner <
>> etudenhoef...@apache.org>:
>>
>>> +1 (binding)
>>>
>>> Verified signature/checksum/license and build/test with JDK17
>>>
>>> On Thu, Nov 7, 2024 at 3:11 AM Daniel Weeks  wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> Verified sigs/sums/license/build/test (Java 17)
>>>>
>>>> -Dan
>>>>
>>>> On Wed, Nov 6, 2024 at 3:23 PM Jack Ye  wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> - Verified signature, checksum, license
>>>>> - Ran build and test with JDK 11 and 17
>>>>> - Ran AWS integration tests
>>>>> - Ran on Spark 3.5 with some manual tests
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> On Wed, Nov 6, 2024 at 9:01 AM Amogh Jahagirdar <2am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 binding
>>>>>>
>>>>>> Verified signatures/checksums/license and ran build/tests with JDK17.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Tue, Nov 5, 2024 at 10:35 PM Yufei Gu 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (binding)
>>>>>>>
>>>>>>>
>>>>>>> Verified signature, checksum, license, build.
>>>>>>>
>>>>>>> Successfully tested the following Spark SQL commands on Polaris,
>>>>>>> using Spark 3.5.3 with the binary artifacts Iceberg 1.7.0 jar. All
>>>>>>> operations worked as expected.
>>>>>>>
>>>>>>> create database db1;
>>>>>>> show databases;
>>>>>>> create table db1.t1 (id int, name string);
>>>>>>> insert into db1.t1 values (1, 'a');
>>>>>>> select * from db1.t1;
>>>>>>> insert into db1.t1 values (2, 'b');
>>>>>>> call polaris.system.expire_snapshots('db1.t1', timestamp '2024-11-11');
>>>>>>> select * from db1.t1.snapshots;
>>>>>>>
>>>>>>> Notably, the snapshot summary shows the latest Iceberg version:
>>>>>>>
>>>>>>> 2024-11-05 18:31:10.92  2780504056765263301  5332711584219924798  append
>>>>>>> file:/tmp/polaris/db1/t1/metadata/snap-2780504056765263301-1-634b033c-45dd-40d6-8cb4-468fe6015ba4.avro
>>>>>>> {
>>>>>>>   "added-data-files": "1",
>>>>>>>   "added-files-size": "611",
>>>>>>>   "added-records": "1",
>>>>>>>   "app-id": "local-1730860071427",
>>>>>>>   "changed-partition-count": "1",
>>>>>>>   "engine-name": "spark",
>>>>>>>   "engine-version": "3.5.3",*  "iceberg-version": "Apache Iceberg 1.7.0 
>>>>>>> (commit 5f7c992ca673bf41df1d37543b24d646c24568a9)",
>>>>>>> *  "spark.app.id": "local-1730860071427",
>>>>>>>   "total-data-files": "2",
>>>>>>>   "total-delete-files": "0",
>>>>>>>   "total-equality-deletes": "0",
>>>>>>>   "total-files-size": "1222",
>>>>>>>   "total-position-deletes": "0",
>>>>>>>   "total-records": "2"
>>>>>>> }
>>>>>>>

[VOTE] Release Apache PyIceberg 0.8.0rc1

2024-11-07 Thread Kevin Liu

Hi Everyone,

I propose that we release the following RC as the official PyIceberg 0.8.0
release.

The commit ID is 0eaadb9
<https://github.com/apache/iceberg-python/commit/0eaadb9e61c7c9373eddaafd723c3be9fd66ab42>

   - This corresponds to the tag: pyiceberg-0.8.0rc1
   (ac00f5354c2c12ed8f465295a3a626e0db9c1689)
   -
   https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.0rc1
   -
   
https://github.com/apache/iceberg-python/tree/0eaadb9e61c7c9373eddaafd723c3be9fd66ab42

The release tarball, signature, and checksums are here:

   - https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.0rc1/

You can find the KEYS file here:

   - https://dist.apache.org/repos/dist/dev/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.8.0rc1/

And can be installed using: pip3 install pyiceberg==0.8.0rc1

Instructions for verifying a release can be found here:

   - https://py.iceberg.apache.org/verify-release/

Please download, verify, and test.

High-level Summary

   - 176 new commits
   - 18 new first-time contributors
   - Deprecation Notice
  - Deprecated configuration properties: profile_name, region_name,
  aws_access_key_id, aws_secret_access_key, and aws_session_token
  - Deprecated functions: to_requested_schema in
  pyiceberg/io/pyarrow.py and add_snapshot and set_ref_snapshot in
  pyiceberg/table/__init__.py
   - Find a detailed list of PRs at
   https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.0rc1
   - Highlights
  - Documentation improvements
 - Improve docstrings, configuration, etc
 - Improve the release process; updated “How to Release” and
 “Verify Release” documentation
  - General
 - Add support for Python 3.12; drop support for Python 3.8;
 exclude Python 3.9.7
 - Bump PyArrow to 18.0.0, remove numpy as a hard dependency
 - Bump up Iceberg version to 1.6.0 in integration tests
  - Features
 - Add metadata tables for data_files and delete_files
 - Add list_views and drop_view to Rest catalog
 - Add partition MonthTransform
 - Support manifest file caching
 - Support Hive Metastore High Availability mode
 - Add properties to allow configuring small/large pyarrow type on
 read
 - Deprecate redundant catalog identifiers in TableIdentifier and
 row_filter expressions
 - Update metadata-log for non-rest catalogs
 - Add support for boolean expressions and quoted columns in
 row_filter expressions
 - Support setting ARN Role and Session name in S3 and Glue
 - Support bi-directional union of types (int <> long, float <>
 double)
 - Support passing table-token to commit endpoint
 - Allow setting write.parquet.row-group-limit and
 write.parquet.page-row-limit
 - Deprecate rest.authorization-url in favor of oauth2-server-uri
 - Support s3.signer.endpoint
 - Add support to configure access delegation header,
 X-Iceberg-Access-Delegation
 - Remove initial_change usage in TableUpdates
 - Prevent adding duplicate files in the add_files API
 - Support fields with . in name
  - Bug Fix
 - Abort the whole table transaction if any updates in the
 transaction have failed
 - Use appropriate partition spec for delete
 - Use self.table_metadata when in transaction
 - Accept empty arrays in struct field lookup
 - List namespace response in rest catalog with fully qualified
 namespace
 - list_tables method in glue catalog now only returns tables,
 instead of views+tables
 - Glue and Hive catalog return only Iceberg tables, instead of
 hive+iceberg tables
 - Invert case_sensitive logic in StructType
 - Fix table_exists behavior in the REST catalog
 - Fix bug where reading with to_arrow_batch_reader return more
 than the limit
 - PyArrow: Pass in null-mask for StructField
 - Fix overwrite when filtering all the data
 - Use the correct spec when rewriting existing manifests
 - Use historical partition field name
 - Fix Position Deletes + row_filter yields less data when the
 DataFile is large
 - Allow for missing operation in Snapshot metadata
 - Fix tracing existing entries when there are deletes
 - Handle Empty RecordBatch within _task_to_record_batches


Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.8.0
[ ] +0

[ ] -1 Do not release this because...



Best,

Kevin Liu

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-25 Thread Kevin Liu

Thanks, Ryan! That makes sense.

I want to follow up on the original issue. I've made a PR [1] to enforce
that the Snapshot `summary` map must have an `operation` key. Please take a
look. Thank you @nastra for the comments and reviews.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg/pull/11354



On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com  wrote:

> > For example, the `Snapshot` `summary` field is optional in V1 but
> required in V2. Therefore, the REST spec definition should mark the
> `summary` field as optional to support both versions.
>
> Yeah, this is technically true. But as I said in my first email, unless
> you have tables that are 5 years old, it's unlikely that this is going to
> be a problem. A failure here is more likely with newer implementations that
> have a bug. So I'd argue there's value in leaving it as required.
>
> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu  wrote:
>
>> > No. They were introduced at the same time.
>> Great! Since the `summary` field and the `operation` key were introduced
>> together, we should enforce the rule that the `summary` field must
>> always have an accompanying `operation` key. This has been addressed in
>> PR 11354 [1].
>>
>> > I am strongly against this. The REST spec should be independent of the
>> table versions.
>> That makes sense. For the REST spec to support both V1 and V2 tables, it
>> should "accept" the least common denominator between the two versions. For
>> example, the `Snapshot` `summary` field is optional in V1 but required in
>> V2. Therefore, the REST spec definition should mark the `summary` field as
>> optional to support both versions. However, the current REST spec leans
>> towards the V2 table spec; fields that are optional in V1 and required in
>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid`
>> [2][3] and `Snapshot.summary` [4][5].
>>
>> Would love to get other people's thoughts on this.
>>
>> Best,
>> Kevin Liu
>>
>> [1] https://github.com/apache/iceberg/pull/11354
>> [2]
>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414
>> [3] https://iceberg.apache.org/spec/#table-metadata-fields
>> [4]
>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325
>> [5] https://iceberg.apache.org/spec/#snapshots
>>
>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com 
>> wrote:
>>
>>> Was it ever valid to have a summary field without the operation key?
>>>
>>> No. They were introduced at the same time.
>>>
>>> Would it be helpful to create alternative versions of the REST spec
>>> specifically for referencing V1 and V2 tables?
>>>
>>> I am strongly against this. The REST spec should be independent of the
>>> table versions. Any table format version can be passed and the table format
>>> should be the canonical reference for what is allowed. We want to avoid
>>> cases where there are discrepancies. The table spec is canonical for table
>>> metadata, and the REST spec allows passing it.
>>>
>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu 
>>> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Thanks, everyone for the discussion, and thanks Ryan for providing the
>>>> historical context.
>>>> Enforce the `operation` key in Snapshot’s `summary` field
>>>>
>>>> When serializing the `Snapshot` object from JSON, the Java
>>>> implementation does not enforce that the `summary` field must contain an
>>>> `operation` key. In the V1 spec, the `summary` field is optional, while in
>>>> the V2 spec, it is required. However, in both versions, if a `summary`
>>>> field is present, it must include an `operation` key. Any `summary` field
>>>> lacking an `operation` key should be considered invalid.
>>>>
>>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint
>>>> when parsing JSON.
>>>>
>>>> > We initially did not have the snapshot summary or operation. When I
>>>> added the summary, the operation was intended to be required in cases where
>>>> the summary is present. It should always be there if the summary is and the
>>>> summary should always be there unless you wrote the metadata.json file
>>>> way back in 2017 or 2018.
>>>>
>>>> @Ryan, does this constraint also apply to `metadata.json` file

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-25 Thread Kevin Liu

Thanks, everyone! The PR[1] has been merged

Best,
Kevin Liu

[1] https://github.com/apache/iceberg/pull/11354


On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu  wrote:

> Thanks, Ryan! That makes sense.
>
> I want to follow up on the original issue. I've made a PR [1] to enforce
> that the Snapshot `summary` map must have an `operation` key. Please take a
> look. Thank you @nastra for the comments and reviews.
>
> Best,
> Kevin Liu
>
> [1] https://github.com/apache/iceberg/pull/11354
>
>
>
> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com  wrote:
>
>> > For example, the `Snapshot` `summary` field is optional in V1 but
>> required in V2. Therefore, the REST spec definition should mark the
>> `summary` field as optional to support both versions.
>>
>> Yeah, this is technically true. But as I said in my first email, unless
>> you have tables that are 5 years old, it's unlikely that this is going to
>> be a problem. A failure here is more likely with newer implementations that
>> have a bug. So I'd argue there's value in leaving it as required.
>>
>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu  wrote:
>>
>>> > No. They were introduced at the same time.
>>> Great! Since the `summary` field and the `operation` key were
>>> introduced together, we should enforce the rule that the `summary`
>>> field must always have an accompanying `operation` key. This has been
>>> addressed in PR 11354 [1].
>>>
>>> > I am strongly against this. The REST spec should be independent of the
>>> table versions.
>>> That makes sense. For the REST spec to support both V1 and V2 tables, it
>>> should "accept" the least common denominator between the two versions. For
>>> example, the `Snapshot` `summary` field is optional in V1 but required in
>>> V2. Therefore, the REST spec definition should mark the `summary` field as
>>> optional to support both versions. However, the current REST spec leans
>>> towards the V2 table spec; fields that are optional in V1 and required in
>>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid`
>>> [2][3] and `Snapshot.summary` [4][5].
>>>
>>> Would love to get other people's thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> [1] https://github.com/apache/iceberg/pull/11354
>>> [2]
>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414
>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields
>>> [4]
>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325
>>> [5] https://iceberg.apache.org/spec/#snapshots
>>>
>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com 
>>> wrote:
>>>
>>>> Was it ever valid to have a summary field without the operation key?
>>>>
>>>> No. They were introduced at the same time.
>>>>
>>>> Would it be helpful to create alternative versions of the REST spec
>>>> specifically for referencing V1 and V2 tables?
>>>>
>>>> I am strongly against this. The REST spec should be independent of the
>>>> table versions. Any table format version can be passed and the table format
>>>> should be the canonical reference for what is allowed. We want to avoid
>>>> cases where there are discrepancies. The table spec is canonical for table
>>>> metadata, and the REST spec allows passing it.
>>>>
>>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu 
>>>> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> Thanks, everyone for the discussion, and thanks Ryan for providing the
>>>>> historical context.
>>>>> Enforce the `operation` key in Snapshot’s `summary` field
>>>>>
>>>>> When serializing the `Snapshot` object from JSON, the Java
>>>>> implementation does not enforce that the `summary` field must contain an
>>>>> `operation` key. In the V1 spec, the `summary` field is optional, while in
>>>>> the V2 spec, it is required. However, in both versions, if a `summary`
>>>>> field is present, it must include an `operation` key. Any `summary` field
>>>>> lacking an `operation` key should be considered invalid.
>>>>>
>>>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint
>>>>> when parsin

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-17 Thread Kevin Liu

> Based on the example metadata, that looks like it is not to spec, so it's
reasonable that python would reject it.  If the java implementation is
allowing for that, it's likely that we're being too relaxed (possibly a
holdover from v1 parsing).
I believe the Java implementation is relaxing the constraint. I'll create a
PR with test cases and the necessary changes.

> Do you know what produced the metadata?
It was created by Snowflake [1]. After verifying this, I'll look into
raising the issue with them.

As a side note, the `rest-catalog-open-api.yaml` file [2] in the Iceberg
repo contains the latest version of the spec. As we're continuing to evolve
to spec for V3, would it be helpful to create a frozen version representing
both the V1 and V2 specs for reference, possibly as a separate file?

Best,
Kevin Liu

[1]
https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455
[2]
https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml

On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks  wrote:

> Sung,
>
> I was thinking of v1, so you're right that manifest-list and summary are
> required as of v2.  The REST Spec seems to follow the v2 definition, so I
> think we're somewhat implicitly requiring those fields via REST.
>
> Kevin,
>
> Based on the example metadata, that looks like it is not to spec, so it's
> reasonable that python would reject it.  If the java implementation is
> allowing for that, it's likely that we're being too relaxed (possibly a
> holdover from v1 parsing).
>
> Do you know what produced the metadata?
>
> -Dan
>
> On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu  wrote:
>
>> Thanks for the additional context.
>>
>> My understanding is that if a Snapshot has a `summary` field, it must
>> also have a corresponding `operation` key in the summary map. Is that
>> correct? Based on the `SnapshotParser`, this is not enforced [1].
>>
>> The underlying issue in #1106 [2] is the missing `operation` field when
>> the `summary` field is present.
>> For example,
>> ```
>> "summary" : {
>>   "manifests-created" : "8",
>>   "total-records" : "26508666891",
>>   "added-files-size" : "3927895626752",
>>   "manifests-kept" : "0",
>>   "total-files-size" : "3927895626752",
>>   "added-records" : "26508666891",
>>   "added-data-files" : "231513",
>>   "manifests-replaced" : "0",
>>   "total-data-files" : "231513"
>> }
>> ```
>>
>> It could be the case that this particular `metadata.json` was generated
>> not according to the spec.
>>
>> Best,
>> Kevin Liu
>>
>>
>> [1]
>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142
>> [2] https://github.com/apache/iceberg-python/issues/1106
>>
>>
>> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun  wrote:
>>
>>> Thank you for the clarification Daniel, and thank you Kevin for raising
>>> this issue!
>>>
>>> Does that mean that we are creating component schemas that are the
>>> superset of the V1 and V2 schemas? And if so, should we remove summary and
>>> manifest-list from the required properties, and add manifests optional
>>> property to the Snapshot schema to support both V1 and V2 Summary specs?
>>> https://iceberg.apache.org/spec/#snapshots
>>>
>>> Or would creating separate component schemas for V1/V2 be a cleaner way
>>> to align the REST spec with the table spec?
>>>
>>> Sung
>>>
>>> On 2024/10/17 15:19:23 Daniel Weeks wrote:
>>> > I'm not convinced this is incorrect behavior (table spec or
>>> > implementation), but it does lend to some confusion.  The 'summary'
>>> field
>>> > is optional, which means that if a summary is not provided, you do not
>>> have
>>> > an associated 'operation' field.  The 'operation' field is only
>>> required in
>>> > the context of the summary, so it's actually possible for the
>>> > implementation (i.e. the tests you reference) to not have an operation.
>>> >
>>> > I think what is wrong here is that the REST spec marked the summary as
>>> > required
>>> > <
>>> https://github.com/apache/ic

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-17 Thread Kevin Liu

Thanks for the additional context.

My understanding is that if a Snapshot has a `summary` field, it must also
have a corresponding `operation` key in the summary map. Is that correct?
Based on the `SnapshotParser`, this is not enforced [1].

The underlying issue in #1106 [2] is the missing `operation` field when the
`summary` field is present.
For example,
```
"summary" : {
  "manifests-created" : "8",
  "total-records" : "26508666891",
  "added-files-size" : "3927895626752",
  "manifests-kept" : "0",
  "total-files-size" : "3927895626752",
  "added-records" : "26508666891",
  "added-data-files" : "231513",
  "manifests-replaced" : "0",
  "total-data-files" : "231513"
}
```

It could be the case that this particular `metadata.json` was generated not
according to the spec.

Best,
Kevin Liu


[1]
https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142
[2] https://github.com/apache/iceberg-python/issues/1106


On Thu, Oct 17, 2024 at 8:47 AM Sung Yun  wrote:

> Thank you for the clarification Daniel, and thank you Kevin for raising
> this issue!
>
> Does that mean that we are creating component schemas that are the
> superset of the V1 and V2 schemas? And if so, should we remove summary and
> manifest-list from the required properties, and add manifests optional
> property to the Snapshot schema to support both V1 and V2 Summary specs?
> https://iceberg.apache.org/spec/#snapshots
>
> Or would creating separate component schemas for V1/V2 be a cleaner way to
> align the REST spec with the table spec?
>
> Sung
>
> On 2024/10/17 15:19:23 Daniel Weeks wrote:
> > I'm not convinced this is incorrect behavior (table spec or
> > implementation), but it does lend to some confusion.  The 'summary' field
> > is optional, which means that if a summary is not provided, you do not
> have
> > an associated 'operation' field.  The 'operation' field is only required
> in
> > the context of the summary, so it's actually possible for the
> > implementation (i.e. the tests you reference) to not have an operation.
> >
> > I think what is wrong here is that the REST spec marked the summary as
> > required
> > <
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040
> >,
> > which is inconsistent with the table spec.
> >
> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi 
> > wrote:
> >
> > > Based on [1], we never persisted the operation in the summary map.
> > > Instead, we persisted it as a top-level field in Java, which is
> actually
> > > NOT what the spec says. Does anyone remember cases when the operation
> was
> > > unknown? I personally don't.
> > >
> > > [1] -
> > >
> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63
> > >
> > >
> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu  пише:
> > >
> > >> Hey folks,
> > >>
> > >> I’ve noticed a discrepancy between the Iceberg specification and the
> Java
> > >> implementation regarding the `operation` key in the `Snapshot`
> `summary`
> > >> field.
> > >>
> > >> The `Snapshot` object's `summary` dictionary includes a *required* key
> > >> named `operation`, as outlined in the spec describing Table Metadata
> and
> > >> Snapshots [1] and the generated OpenAPI YAML [2]. However, in the Java
> > >> implementation [3], `operation` is treated as optional. In contrast,
> it
> > >> remains a required field in the Python implementation [4].
> > >> I also found that Java tests for `SnapshotParser` assert that the
> > >> `operation` field is null. [5]
> > >>
> > >> Due to this discrepancy, a user reported [6] that the `metadata.json`
> > >> file generated for an Iceberg table could not be read by PyIceberg,
> though
> > >> it is readable using the Iceberg Java library.
> > >>
> > >> How should we proceed from here? Should the Java library enforce this
> > >> requirement? Additionally, how should we handle existing
> `metadata.json`
> > >> files that were generated without this field?
> > >>
> > >> Best,
> > >> Kevin Liu
> > >>
> > >> [1] https://iceberg.apache.org/spec/#table-metadata-and-snapshots
> > >> [2]
> > >>
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
> > >> [3]
> > >>
> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124
> > >> [4]
> > >>
> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182
> > >> [5]
> > >>
> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52
> > >> [6] https://github.com/apache/iceberg-python/issues/1106
> > >>
> > >>
> >
>

Re: [DISCUSS] PyIceberg 0.8.1 release

2024-11-25 Thread Kevin Liu

Hey folks,

I started working on the 0.8.1 release, using the updated "how to release"
docs (
https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md
)
Here are the 9 commmits I propose to be included in this next release
https://github.com/apache/iceberg-python/pull/1369

Please let me know what you think.

Best,
Kevin Liu


On Thu, Nov 21, 2024 at 10:05 AM Kevin Liu  wrote:

> Thanks for starting this thread!
>
> Along with the 2 issues listed above, I propose this issue as well
> * Ignore tables with missing table_type parameter in HMS and Glue (#1331
> <https://github.com/apache/iceberg-python/issues/1331>)
>
> Best,
> Kevin Liu
>
> On Thu, Nov 21, 2024 at 5:18 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Fokko
>>
>> It makes sense to me.
>>
>> Regards
>> JB
>>
>> On Thu, Nov 21, 2024 at 9:14 AM Fokko Driesprong 
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > I suggest following up on the PyIceberg 0.8.0 release with a patch
>> release.
>> >
>> > Currently, we have two candidate bugfixes to be included:
>> >
>> > An issue where it falsely emits a warning when loading a table.
>> > Another issue is when trying to add a parquet file to a table, that
>> doesn't have column statistics for at least one column.
>> >
>> > Feel free to chime in on this thread if you want to include bug fixes
>> in the 0.8.1 milestone.
>> >
>> > Kind regards,
>> > Fokko
>>
>

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-25 Thread Kevin Liu

Hey folks,

I did a code search for both `actions/upload-artifact` and
`actions/download-artifact` in the related iceberg repos.
*
https://grep.app/search?q=actions/upload-artifact%40v3&filter[repo.pattern][0]=apache/iceberg
*
https://grep.app/search?q=actions/download-artifact&filter[repo.pattern][0]=apache/iceberg

Only iceberg-python is affected. Here's the PR to update the relevant
action, https://github.com/apache/iceberg-python/pull/1371

Best,
Kevin Liu

On Mon, Nov 25, 2024 at 10:36 AM Jacob Wujciak 
wrote:

> Hello Everyone!
>
> I am writing to inform you of the imminent removal of the v3 artifact
> actions that was announced in [1]. Both actions/upload-artifact@v3*
> and actions/download-artifact@v3* will stop working in 10 days, on
> December 5, 2024! According to a quick code search this project is
> using one of the actions with a v3 tag in at least one of its repos.
>
> There are breaking changes in the usage of the upload action that will
> likely require changes other than bumping the version, please see [2].
> Make sure to update your workflows in time to avoid disruptions!
>
> If you have any questions or need help with the transition I'd
> recommend bui...@apache.org as the place to look for help.
>
> Regards
> Jacob Wujciak-Jens (assignUser)
>
> [1]:
> https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
> [2]:
> https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
>

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-27 Thread Kevin Liu

Thanks Sung. I assumed grep.app will continuously index all GitHub repos
but it seems to be missing a few.

For completeness, I went through the GitHub search feature, using
`org:apache` with both `upload-artifact@v3` and `download-artifact@v3`.
* https://github.com/search?q=org%3Aapache%20upload-artifact%40v3&type=code
* https://github.com/search?q=org%3Aapache+download-artifact%40v3&type=code

Looks like `iceberg-rust` is the only place we missed.

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 5:45 AM Sung Yun  wrote:

> Hi JB and Kevin, thank you for jumping on the chore.
>
> Here's one more PR to bump up the version in iceberg-rust:
> https://github.com/apache/iceberg-rust/pull/725
>
> I assume this didn't show up in the grep.app search since it was recently
> merged
>
> On 2024/11/26 22:22:36 Kevin Liu wrote:
> > We merged the PR[1] to upgrade `upload-artifact` to V4. Thanks, Fokko for
> > the review.
> >
> > Best,
> > Kevin Liu
> >
> > [1] https://github.com/apache/iceberg-python/pull/1371
> >
> >
> > On Mon, Nov 25, 2024 at 10:36 PM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Kevin
> > >
> > > I did a quick search and I have the same feedback as you: only
> > > iceberg-python is impacted.
> > >
> > > Thanks for the PR !
> > >
> > > Regards
> > > JB
> > >
> > > On Mon, Nov 25, 2024 at 9:03 PM Kevin Liu 
> wrote:
> > > >
> > > > Hey folks,
> > > >
> > > > I did a code search for both `actions/upload-artifact` and
> > > `actions/download-artifact` in the related iceberg repos.
> > > > *
> > >
> https://grep.app/search?q=actions/upload-artifact%40v3&filter[repo.pattern][0]=apache/iceberg
> > > > *
> > >
> https://grep.app/search?q=actions/download-artifact&filter[repo.pattern][0]=apache/iceberg
> > > >
> > > > Only iceberg-python is affected. Here's the PR to update the relevant
> > > action, https://github.com/apache/iceberg-python/pull/1371
> > > >
> > > > Best,
> > > > Kevin Liu
> > > >
> > > > On Mon, Nov 25, 2024 at 10:36 AM Jacob Wujciak <
> assignu...@apache.org>
> > > wrote:
> > > >>
> > > >> Hello Everyone!
> > > >>
> > > >> I am writing to inform you of the imminent removal of the v3
> artifact
> > > >> actions that was announced in [1]. Both actions/upload-artifact@v3*
> > > >> and actions/download-artifact@v3* will stop working in 10 days, on
> > > >> December 5, 2024! According to a quick code search this project is
> > > >> using one of the actions with a v3 tag in at least one of its repos.
> > > >>
> > > >> There are breaking changes in the usage of the upload action that
> will
> > > >> likely require changes other than bumping the version, please see
> [2].
> > > >> Make sure to update your workflows in time to avoid disruptions!
> > > >>
> > > >> If you have any questions or need help with the transition I'd
> > > >> recommend bui...@apache.org as the place to look for help.
> > > >>
> > > >> Regards
> > > >> Jacob Wujciak-Jens (assignUser)
> > > >>
> > > >> [1]:
> > >
> https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
> > > >> [2]:
> > > https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
> > >
> >
>

Re: [DISCUSS] iceberg rust 0.4.0 and iceberg pyiceberg_core 0.1.0 release

2024-11-27 Thread Kevin Liu

Thanks for driving this, Sung! I'm +1 to release both iceberg-rust and
pyiceberg_core. It's very exciting to see pyiceberg_core and its
integration with PyIceberg.
It makes sense to decouple pyiceberg_core from iceberg-rust since the two
"projects" are on different tracks. We'd want to release pyiceberg_core
features independent of iceberg-rust features.

Please let me know if there's anything I can do to help.

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 6:13 AM Sung Yun  wrote:

> Hi folks, it's been some time since we've done an Iceberg Rust release,
> and we've finally set up the ghactions workflow[1] that will allow us to
> build and publish an abi3 compatible wheel to Pypi.
>
> If we are still +1 for the release (both iceberg-rust and pyiceberg_core),
> I think it'll be awesome to get this release out soon as it will help the
> PyIceberg community test out the pyiceberg_core binding in preparation for
> the next release.
>
> Another option would be to introduce a workflow_dispatch trigger to the
> python_release.yml and run a decoupled, release for pyiceberg_core[2]
>
> I'd be happy to help run the release, if no one has started looking into
> it already.
>
> Sung
>
> [1] https://github.com/apache/iceberg-rust/pull/705
> [2] https://lists.apache.org/thread/j22o7yktrlddrgkcy7gl88o23nyrgooc
>
> On 2024/09/05 14:06:10 xianjin wrote:
> > +1 for this pyiceberg_core as well.
> >
> >
> >
> > Two cents about the iceberg-rust release schedule: it seems too
> aggressive to
> > release by 2 weeks, monthly(4 weeks) release would be a nice fit.
> >
> > Sent from my iPhone
> >
> >
> >
> > > On Sep 5, 2024, at 8:25 PM, Sung Yun  wrote:
> > >
> > >
> >
> > > 
> > >
> > > Thank you for driving this Xuanwo!
> > >
> > >
> > >
> > >
> > > +1 as well, as noted the 0.1.0 pyiceberg_core release will allow
> PyIceberg
> > > to begin integrating with the rust based core and introduce a new
> feature
> > > that the community is looking for.
> > >
> > >
> > >
> > >
> > > On Thu, Sep 5, 2024 at 6:05 AM Renjie Liu
> > > <[liurenjie2...@gmail.com](mailto:liurenjie2...@gmail.com)> wrote:
> > >
> > >
> >
> > >> +1 for this release.
> > >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> As iceberg-rust is under fast development, a shorter release (3-4
> weeks)
> > schedule would benefit users so that they don't need to rely on a
> snapshot
> > version.
> >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> On Thu, Sep 5, 2024 at 3:26 PM Xuanwo
> > <[xua...@apache.org](mailto:xua...@apache.org)> wrote:
> > >
> > >>
> >
> > >>> Hello, everyone
> > >
> > >  I'm starting this thread to discuss the release of iceberg rust 0.4.0
> and
> > > iceberg pyiceberg_core 0.1.0.
> > >
> > >  There is no specific reason for this release. I just want to align
> with the
> > > two- to three-week release schedule of iceberg rust so users don't
> have to
> > > wait long or encounter too many breaking changes at once.
> > >
> > >  Additionally, the pyiceberg team is awaiting our first release of
> > > pyiceberg_core 0.1.0 so they can integrate with it, see how it works,
> and
> > > explore ways to improve collaboration.
> > >
> > >  What do you think?
> > >
> > >  Xuanwo
> > >
> > >  <https://xuanwo.io/>
> > >
> >
> >
>

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-11-27 Thread Kevin Liu

Thanks for driving this Honah!

It's important to have a consistent naming scheme so that we don't need to
worry about edge cases when using multiple engines, and possibly have to
deal with migrations.

Also, since users can store arbitrary key/value pairs in the summary
property, it's good to document the currently used properties to avoid
collision.

I like the proposal to document all properties in a "snapshot summary"
table, this will ensure a centralized place to view all possible key/value
pairs, similar to how FileIO configuration is handled in iceberg-python
<https://py.iceberg.apache.org/configuration/#s3>. Other
implementations can use this table as a reference.

 > This approach offers flexibility, as new fields can be added through
documentation updates without requiring specification changes.
This will save a lot of effort since specification changes require
greater scrutiny.

> summary details would not be located near the Snapshot section, which
explains the summary field.
We can link the table to the Snapshot section.


Would love to hear others' thoughts on this.

Best,
Kevin Liu

On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:

> Hi everyone,
>
> I’d like to propose an addition to the table specification to document
> optional fields in the snapshot summary.
>
> Currently, the snapshot summary includes a required operation field and
> various optional fields. While these optional fields—such as metrics and
> partition-level summaries—are supported by Java
> <https://github.com/apache/iceberg/blob/549674b3fc0cdb18d6cad3e2d6320236fba8c562/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L32-L64>
> and Python
> <https://github.com/HonahX/iceberg-python/blob/45d611fe351f6f3847bf329aa053d890d810e2b6/pyiceberg/table/snapshots.py#L36-L60>
> implementations, they are not officially documented. This creates risks of
> inconsistency as other implementations and engines adopt and interact with
> these fields.
>
> I propose adding a new section to the table specification to document
> these optional fields, ensuring consistent naming conventions and reducing
> ambiguity across implementations. While this is the primary proposal, it
> may also be worth discussing whether documenting these fields separately in
> Docs/Table would provide additional flexibility for future updates.
>
> I’d love to hear your thoughts, suggestions, or concerns about this
> proposal.
>
> Looking forward to the discussion!
>
> Links
>
>- GitHub tracking issue: https://github.com/apache/iceberg/issues/11659
>- Proposal:
>
> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
>- PR: https://github.com/apache/iceberg/pull/11660
>
>
> Best regards,
> Honah
>

Re: [VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Kevin Liu

Hey Sung,

Good point. For context, I accidentally generated and uploaded to PyPi a
version with `0.8.1` instead of `0.8.1rc1`. Fokko helped me yank that
version. https://pypi.org/project/pyiceberg/0.8.1/

If this RC passes, we can un-yank and reuse the currently uploaded version.
Otherwise, I can create a new patch version using `0.8.2`. How does that
sound?

Additionally, I created a PR to prevent this from happening again.
https://github.com/apache/iceberg-python/pull/1386

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 5:07 PM Sung Yun  wrote:

> Hi Kevin,
>
> Thank you so much for working on this release!
>
> I noticed this morning that PyIceberg 0.8.1 was released and yanked[1]
> this morning. Similar to how we had handled it when this had happened last
> time, I think this would mean that we would need to now move on to the next
> version and publish it as a PyIceberg 0.8.2 release instead. Hence, I think
> it would make sense to start a new vote thread with the incremented version.
>
> Sung
>
> [1] https://pypi.org/project/pyiceberg/
>
> On Wed, Nov 27, 2024 at 7:55 PM Kevin Liu  wrote:
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official PyIceberg
>> 0.8.1 release.
>>
>> The commit ID is a051584a3684392d2db6556449eb299145d47d15
>>
>> * This corresponds to the tag: pyiceberg-0.8.1rc1
>> (17124779c5294cb928f3807ed539f427f9b4bd2e)
>> *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
>> *
>> https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15
>>
>> The release tarball, signature, and checksums are here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/
>>
>> You can find the KEYS file here:
>>
>> * https://downloads.apache.org/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on pypi:
>>
>> https://pypi.org/project/pyiceberg/0.8.1rc1/
>>
>> And can be installed using: pip3 install pyiceberg==0.8.1rc1
>>
>> Instructions for verifying a release can be found here:
>>
>> * https://py.iceberg.apache.org/verify-release/
>>
>> High-Level Summary
>> *Breaking Changes*
>> * The `Table.name` method now returns the table name *without the
>> catalog name*, as part of a broader effort to remove catalog references
>> in PyIceberg.
>>   * Replace usages of `Table.identifier` with `Table.name` in the codebase
>>   * Replace usages of the deprecated function
>> (`identifier_to_tuple_without_catalog`) in the codebase which removes
>> unnecessary warnings
>>
>>
>> *Bug fixes** Fix `add_files` for parquet files missing column statistics
>> * Allow leading underscore in column name used in row filter
>> * Ignore Glue and Hive tables missing the `table_type` property
>> * Write `null` in manifest list metadata when there is no
>> `parent-snapshot-id`
>>
>>
>> *Dependency Updates** Removed upper-bound restrictions on dependencies;
>> allow early testing of new versions:
>>   * Remove Python library version upper bound restriction; allow Python
>> 3.13
>>   * Remove fsspec library version upper bound restriction
>>
>>
>> *Documentation Updates** Improve “how to release” documentation
>> * Included post-release steps for version 0.8.0
>> * Included documentation updates in this patch release to reflect these
>> changes in https://py.iceberg.apache.org/
>>
>> *Commit Summary*
>> * [36 new commits since the `0.8.0` release](
>> https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).
>>
>> * 12 new commits will be included in 0.8.1
>>   * 11 commits cherry-picked as bug fixes (listed below)
>>   * 1 [commit](
>> https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
>> to bump version to `0.8.1`
>>
>> *Detailed Commits*
>> * acbd071 Write `null` when there is no parent-snapshot-id (#1383)
>> * bb078cf Add instruction for patch release (#1373)
>> * ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
>> have column stats (#1354)
>> * cc1ab2c Improve documentation for "how to release" (#1359)
>> * 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
>> * d86ab6e Allow leading underscore in column name used in row filter
>> (#1358)
>> * 7a4734e Replace reference of `Table.identifier` with `Table.name`
>> (#1346)
>> * a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
>> * 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
>> * 7660a5b 0.8.0 post release steps (#1334)
>> * b2f0a9e use the non-deprecated func (#1326)
>>
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>> [ ] +1 Release this as PyIceberg 0.8.1
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Best,
>> Kevin Liu
>>
>

[VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Kevin Liu

Hi Everyone,

I propose that we release the following RC as the official PyIceberg 0.8.1
release.

The commit ID is a051584a3684392d2db6556449eb299145d47d15

* This corresponds to the tag: pyiceberg-0.8.1rc1
(17124779c5294cb928f3807ed539f427f9b4bd2e)
* https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
*
https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15

The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/

You can find the KEYS file here:

* https://downloads.apache.org/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.8.1rc1/

And can be installed using: pip3 install pyiceberg==0.8.1rc1

Instructions for verifying a release can be found here:

* https://py.iceberg.apache.org/verify-release/

High-Level Summary
*Breaking Changes*
* The `Table.name` method now returns the table name *without the catalog
name*, as part of a broader effort to remove catalog references in
PyIceberg.
  * Replace usages of `Table.identifier` with `Table.name` in the codebase
  * Replace usages of the deprecated function
(`identifier_to_tuple_without_catalog`) in the codebase which removes
unnecessary warnings


*Bug fixes** Fix `add_files` for parquet files missing column statistics
* Allow leading underscore in column name used in row filter
* Ignore Glue and Hive tables missing the `table_type` property
* Write `null` in manifest list metadata when there is no
`parent-snapshot-id`


*Dependency Updates** Removed upper-bound restrictions on dependencies;
allow early testing of new versions:
  * Remove Python library version upper bound restriction; allow Python 3.13
  * Remove fsspec library version upper bound restriction


*Documentation Updates** Improve “how to release” documentation
* Included post-release steps for version 0.8.0
* Included documentation updates in this patch release to reflect these
changes in https://py.iceberg.apache.org/

*Commit Summary*
* [36 new commits since the `0.8.0` release](
https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).

* 12 new commits will be included in 0.8.1
  * 11 commits cherry-picked as bug fixes (listed below)
  * 1 [commit](
https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
to bump version to `0.8.1`

*Detailed Commits*
* acbd071 Write `null` when there is no parent-snapshot-id (#1383)
* bb078cf Add instruction for patch release (#1373)
* ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
have column stats (#1354)
* cc1ab2c Improve documentation for "how to release" (#1359)
* 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
* d86ab6e Allow leading underscore in column name used in row filter (#1358)
* 7a4734e Replace reference of `Table.identifier` with `Table.name` (#1346)
* a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
* 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
* 7660a5b 0.8.0 post release steps (#1334)
* b2f0a9e use the non-deprecated func (#1326)


Please download, verify, and test.

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.8.1
[ ] +0
[ ] -1 Do not release this because...

Best,
Kevin Liu

Re: [VOTE] Release Apache PyIceberg 0.8.0rc2

2024-11-18 Thread Kevin Liu

Thanks everyone for voting! The 72 hours have passed, and a minimum of 3
binding votes have been cast:

The vote passes with 3 non-binding +1 votes and 3 binding +1 votes and no
-1 votes:
non-binding: Kevin, Sung, Andre
binding: Fokko, Honah, Daniel

The release candidate has been accepted as PyIceberg 0.8.0. Thanks
everyone, when all artifacts are published the announcement will be sent
out.

Best,
Kevin Liu

On Mon, Nov 18, 2024 at 10:07 AM Daniel Weeks  wrote:

> +1 (binding)
>
> Verified sigs/sums/license/tests+s3 (Python 3.11.9)
>
> -Dan
>
> On Sat, Nov 16, 2024 at 4:03 PM André Luis Anastácio
>  wrote:
>
>> +1 (non-binding)
>>
>> - verified signature and checksum
>> - verified license check
>> - ran install and some manual tests in python 3.11
>>
>> André Anastácio
>>
>> On Saturday, November 16th, 2024 at 4:08 AM, Honah J. 
>> wrote:
>>
>> +1 (binding)
>>
>> Thanks for running the release!
>>
>> - Verified signatures/checksum/license
>> - Ran tests "make test-coverage" in python 3.11
>>
>> Best regards,
>> Honah
>>
>> On Fri, Nov 15, 2024 at 7:46 AM Fokko Driesprong 
>> wrote:
>>
>>> +1 binding
>>>
>>> Thanks for running this release! Checked the signatures, checksums, and
>>> licenses.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op vr 15 nov 2024 om 14:52 schreef Sung Yun :
>>>
>>>> Hi Kevin,
>>>>
>>>> Thank you again for running this release!
>>>>
>>>> I've verified the License headers, checksums and signatures.
>>>>
>>>> Downloaded the RC from SVN and ran the tests.
>>>>
>>>> Downloaded the package from pypi and ran sanity checks.
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Sung
>>>>
>>>> On 2024/11/14 20:56:44 Kevin Liu wrote:
>>>> > Hi Everyone,
>>>> >
>>>> > I propose that we release the following RC as the official PyIceberg
>>>> 0.8.0
>>>> > release.
>>>> >
>>>> > The commit ID is 3ccdc44735d70bd3ef6ed18b60b3eba43c4b3b44
>>>> > <
>>>> https://github.com/apache/iceberg-python/commit/3ccdc44735d70bd3ef6ed18b60b3eba43c4b3b44
>>>> >
>>>> >
>>>> > -
>>>> >
>>>> > This corresponds to the tag: pyiceberg-0.8.0rc2
>>>> > (4a7abd0478996547ee68a5ee1847130bc0a45c10)
>>>> > -
>>>> >
>>>> >
>>>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.0rc2
>>>> > -
>>>> >
>>>> >
>>>> >
>>>> https://github.com/apache/iceberg-python/tree/3ccdc44735d70bd3ef6ed18b60b3eba43c4b3b44
>>>> >
>>>> > The release tarball, signature, and checksums are here:
>>>> >
>>>> > -
>>>> >
>>>> > https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.0rc2/
>>>> >
>>>> > You can find the KEYS file here:
>>>> >
>>>> > -
>>>> >
>>>> > https://downloads.apache.org/iceberg/KEYS
>>>> >
>>>> > Convenience binary artifacts are staged on pypi:
>>>> >
>>>> > https://pypi.org/project/pyiceberg/0.8.0rc2/
>>>> >
>>>> > And can be installed using: pip3 install pyiceberg==0.8.0rc2
>>>> >
>>>> > Instructions for verifying a release can be found here:
>>>> >
>>>> > -
>>>> >
>>>> > https://py.iceberg.apache.org/verify-release/
>>>> >
>>>> > Please download, verify, and test.
>>>> >
>>>> > High-level Summary
>>>> >
>>>> > -
>>>> >
>>>> > 185
>>>> > <
>>>> https://github.com/apache/iceberg-python/compare/pyiceberg-0.7.1...pyiceberg-0.8.0rc2
>>>> >
>>>> > new commits
>>>> > -
>>>> >
>>>> > 18 new first-time contributors
>>>> > -
>>>> >
>>>> > Deprecation Notice
>>>> > -
>>>> >
>>>> > Deprecated configuration properties: profile_name, region_name,
>>>> > aws_access_key_id, aws_secret_access_key, and aws_session_token
>>>> > -
>>>> >
&

[ANNOUNCE] Apache PyIceberg release 0.8.0

2024-11-18 Thread Kevin Liu

Hi everyone,

I'm pleased to announce the release of Apache PyIceberg 0.8.0!

Apache Iceberg is an open table format for huge analytic datasets. Iceberg
delivers high query performance for tables with tens of petabytes of data,
along with atomic commits, concurrent writes, and SQL-compatible table
evolution.

This Python release can be downloaded from:
https://pypi.org/project/pyiceberg/0.8.0/

Thanks to everyone for contributing!

Best,
Kevin Liu

Re: [ANNOUNCE] Apache Iceberg Go release v0.1.0

2024-11-18 Thread Kevin Liu

Excited to see the first official release of the Apache Iceberg Go library!
Thanks everyone for contributing!  And thanks Matt & Fokko for working on
the release.

Cheers,
Kevin Liu

On Mon, Nov 18, 2024 at 11:10 AM Matt Topol  wrote:

> Hi everyone,
>
> I'm pleased to announce the release of Apache Iceberg Go v0.1.0!
>
> Apache Iceberg is an open table format for huge analytic datasets, Iceberg
> delivers high query performance for tables with tens of petabytes of data,
> along with atomic commits, concurrent writes, and SQL-compatible table
> evolution.
>
> This Go release can be installed via `go get
> github.com/apache/iceberg@v0.1.0`
> <http://github.com/apache/iceberg@v0.1.0> and the documentation can be
> found at https://pkg.go.dev/github.com/apache/iceberg-go@v0.1.0
>
> Thanks to everyone for contributing!!
>
> --Matt
>

Re: [DISCUSS] Deprecate embedded manifests

2024-11-19 Thread Kevin Liu

+1

On Tue, Nov 19, 2024 at 9:23 AM Bryan Keller  wrote:

> +1 to deprecate
>
> On Nov 19, 2024, at 3:32 AM, Fokko Driesprong  wrote:
>
> Hi everyone,
>
> I would like to propose to deprecate embedded manifests
> . This has been used before
> the manifest-list was introduced, but I don't think they are used since the
> project has been open-sourced, and it would be good to officially deprecate
> them from the spec. It is only supported by Iceberg Java today, and I
> haven't seen any requests for PyIceberg to add support for this.
>
> Any questions or concerns about deprecating the embedded manifests?
>
> Kind regards,
> Fokko Driesprong
>
>
>

Re: [VOTE] Deprecate and remove last-column-id

2024-11-19 Thread Kevin Liu

+1 (non-binding). The spec and code deprecation schedule looks good to me.

Best,
Kevin Liu

On Tue, Nov 19, 2024 at 8:42 AM Christian Thiel
 wrote:

> +1 (non-binding) – looks like we are going in the right direction in rust!
>
>
> Christian
>
>
> On 19. Nov 2024, at 16:13, Jack Ye  wrote:
>
> +1
>
> -Jack
>
> On Tue, Nov 19, 2024 at 7:45 AM Russell Spitzer 
> wrote:
>
>> +1
>>
>> On Tue, Nov 19, 2024 at 4:11 AM Fokko Driesprong 
>> wrote:
>>
>>> Hey Manu,
>>>
>>> That's an excellent question. I took the following rationale:
>>>
>>>- For the code, the iceberg-core module, a minor release deprecation
>>>cycle is required
>>><https://iceberg.apache.org/contribute/#semantic-versioning>.
>>>- For the spec, I noticed that the deprecation of the
>>>
>>> <https://github.com/apache/iceberg/blob/7af519ad5df13256fda480cc31e975e63dd8763b/open-api/rest-catalog-open-api.yaml#L186-L187>
>>>getToken
>>>
>>> <https://github.com/apache/iceberg/blob/7af519ad5df13256fda480cc31e975e63dd8763b/open-api/rest-catalog-open-api.yaml#L186-L187>
>>>endpoint
>>>
>>> <https://github.com/apache/iceberg/blob/7af519ad5df13256fda480cc31e975e63dd8763b/open-api/rest-catalog-open-api.yaml#L186-L187>
>>>  was
>>>set for removal in 2.0, so that's what I took for the last-column-id PR
>>>as well.
>>>
>>> To conclude: both spec and code will be deprecated in the following
>>> minor release (1.8.x), the removal from the code is staged for the next
>>> minor (1.9.x) or major release (2.x.x), and the removal from the spec is
>>> planned for (2.x.x). Hope this clarifies.
>>>
>>> Kind regards,
>>> Fokko Driesprong
>>>
>>> Op di 19 nov 2024 om 10:45 schreef Manu Zhang :
>>>
>>>> Thanks Fokko.
>>>>
>>>> To be clear, are you proposing to deprecate last-column-id in 1.8.0 and
>>>> remove in 1.9.0+?
>>>>
>>>> On Tue, Nov 19, 2024 at 4:18 PM Fokko Driesprong 
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Based on the positive feedback on the [DISCUSS] thread
>>>>> <https://lists.apache.org/thread/jz5s7pm2bhbm87ft495d6yrsh3bqvtb9> and
>>>>> the pull-request on GitHub
>>>>> <https://github.com/apache/iceberg/pull/11514/>, I would like to
>>>>> raise a vote to deprecate and remove the last-column-id field from the
>>>>> spec. Since this is a spec change, please vote in the next 72 hours:
>>>>>
>>>>> [ ] +1, commit the proposed spec changes
>>>>> [ ] 0
>>>>> [ ] -1, do not make these changes because...
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>
>

Re: Changing Ownership and Cadence for Catalog Community Sync

2024-11-19 Thread Kevin Liu

Thanks, Honah, I received the new GCal invite. I double-checked the links
for Google Meets and meeting notes, everything seems to be correct.

Best,
Kevin Liu

On Tue, Nov 19, 2024 at 11:48 AM Honah J.  wrote:

> Thanks Jack!
>
> Hi everyone, I am very happy to help host the meeting series. Just a quick
> heads up that tomorrow's Iceberg Catalog Community Sync (Nov 20 9:00 am -
> 10:00am PST) meeting will proceed as usual. I will create a new event
> series in the same calendar soon.
>
> Best regards,
> Honah
>
> On Tue, Nov 19, 2024 at 8:26 AM Jack Ye  wrote:
>
>> Hi everyone,
>>
>> We have been doing the catalog community sync for quite a few months now
>> and have made quite some good progress on the REST catalog development
>> front.
>>
>> I personally have some plans for travelling in the next few months and
>> would likely not be able to host or join the meeting series. But luckily I
>> happened to talk to Honah a few days ago and he is willing to take it over.
>> Because so far I have been using my personal Google One plan for scheduling
>> the meetings, I will go ahead to cancel my current calendar invites and
>> transfer the ownership to Honah for now. If there is anyone else willing to
>> help host the meeting series, definitely let us know and we could rotate.
>>
>> During the last sync, we also discussed the possibility to reduce the
>> cadence of the catalog sync meeting to once every 3 weeks, rather than
>> twice every 3 weeks, given the participation rate and amount of topics we
>> have on the discussion schedule. Please let us know if there are any
>> thoughts about this proposal, and we can coordinate accordingly.
>>
>> Best,
>> Jack Ye
>>
>

Re: [VOTE] Release Apache PyIceberg 0.8.0rc1

2024-11-14 Thread Kevin Liu

Thanks for testing, Sung! I agree, let's include this fix and cut an RC2.

> Noted a couple small things that I don't think are blockers:
Thanks, Dan. I've created Github Issues to track both of the issues you
mentioned.
https://github.com/apache/iceberg-python/issues/1317
https://github.com/apache/iceberg-python/issues/1318

Best,
Kevin Liu

On Wed, Nov 13, 2024 at 7:37 PM Sung Yun  wrote:

> Hi folks,
>
> While testing out the Rest Catalog Adapter docker image that Ajantha has
> been working working on, I ran into an issue when parsing the TableResponse
> of a staged table. While the metadata-location is an optional field
> according to the Iceberg Rest Catalog Spec, the field is being handled as a
> required field in PyIceberg, due to peculiarities in how the pydantic model
> needs to be defined in order to allow for the field to truly be optional in
> the provided response.
>
> The implication of not fixing this, would be that PyIceberg 0.8.0 would
> not be able to support staged table transactions against REST Catalog
> Servers that omits the `metadata-location` field in the TableResponse.
>
> @ Kevin - what are your thoughts on cutting out a second RC that includes
> this fix?
>
> Here's the PR to resolve this issue, that explains the issue in more
> detail: https://github.com/apache/iceberg-python/pull/1321
>
> On 2024/11/13 13:11:46 Jean-Baptiste Onofré wrote:
> > +1 (non binding)
> >
> > I checked:
> > - Signature and hash are OK
> > - ASF header present
> > - LICENSE and NOTICE look good
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Thu, Nov 7, 2024 at 10:57 PM Kevin Liu 
> wrote:
> > >
> > > Hi Everyone,
> > >
> > > I propose that we release the following RC as the official PyIceberg
> 0.8.0 release.
> > >
> > > The commit ID is 0eaadb9
> > >
> > > This corresponds to the tag: pyiceberg-0.8.0rc1
> (ac00f5354c2c12ed8f465295a3a626e0db9c1689)
> > >
> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.0rc1
> > >
> https://github.com/apache/iceberg-python/tree/0eaadb9e61c7c9373eddaafd723c3be9fd66ab42
> > >
> > > The release tarball, signature, and checksums are here:
> > >
> > > https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.0rc1/
> > >
> > > You can find the KEYS file here:
> > >
> > > https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> > >
> > > Convenience binary artifacts are staged on pypi:
> > >
> > > https://pypi.org/project/pyiceberg/0.8.0rc1/
> > >
> > > And can be installed using: pip3 install pyiceberg==0.8.0rc1
> > >
> > > Instructions for verifying a release can be found here:
> > >
> > > https://py.iceberg.apache.org/verify-release/
> > >
> > > Please download, verify, and test.
> > >
> > > High-level Summary
> > >
> > > 176 new commits
> > > 18 new first-time contributors
> > > Deprecation Notice
> > >
> > > Deprecated configuration properties: profile_name, region_name,
> aws_access_key_id, aws_secret_access_key, and aws_session_token
> > > Deprecated functions: to_requested_schema in pyiceberg/io/pyarrow.py
> and add_snapshot and set_ref_snapshot in pyiceberg/table/__init__.py
> > >
> > > Find a detailed list of PRs at
> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.0rc1
> > > Highlights
> > >
> > > Documentation improvements
> > >
> > > Improve docstrings, configuration, etc
> > > Improve the release process; updated “How to Release” and “Verify
> Release” documentation
> > >
> > > General
> > >
> > > Add support for Python 3.12; drop support for Python 3.8; exclude
> Python 3.9.7
> > > Bump PyArrow to 18.0.0, remove numpy as a hard dependency
> > > Bump up Iceberg version to 1.6.0 in integration tests
> > >
> > > Features
> > >
> > > Add metadata tables for data_files and delete_files
> > > Add list_views and drop_view to Rest catalog
> > > Add partition MonthTransform
> > > Support manifest file caching
> > > Support Hive Metastore High Availability mode
> > > Add properties to allow configuring small/large pyarrow type on read
> > > Deprecate redundant catalog identifiers in TableIdentifier and
> row_filter expressions
> > > Update metadata-log for non-rest catalogs
> > > Add support for boolean expressions and quoted columns in row_filter
> expressions
&

[VOTE] Release Apache PyIceberg 0.8.0rc2

2024-11-14 Thread Kevin Liu

iting existing manifests
 -

 Use historical partition field name
 -

 Fix Position Deletes + row_filter yields less data when the
 DataFile is large
 -

 Allow for missing operation in Snapshot metadata
 -

 Fix tracing existing entries when there are deletes
 -

 Handle Empty RecordBatch within _task_to_record_batches

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.8.0
[ ] +0

[ ] -1 Do not release this because...

Best,

Kevin Liu

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-15 Thread Kevin Liu

+1 to Iceberg REST TCK docker image. Thanks, JB for driving this and
Ajantha for setting up the docker image.
We already found a bug in PyIceberg [1] from integrating with the TCK
docker image. It would be great to have a nightly build, perhaps we can set
up a Github Action to automate the docker image publishing.

Best,
Kevin Liu


[1] https://github.com/apache/iceberg-python/pull/1321

On Fri, Nov 15, 2024 at 1:36 AM Fokko Driesprong  wrote:

> +1 — excited to see this happen!
>
> For the TCK, I think we can release this with the Java together, and have
> a nightly build (tag the container with nightly Dockerhub). This way we can
> already test out (and start implementing) the new features in the related
> projects. Thoughts on that?
>
> Regarding the Kafka Connect Docker image, I believe that if we maintain
>> it, we could also manage other integration images, such as those for Spark
>> and Trino with Iceberg. We should have a separate discussion on which
>> integration images Iceberg should officially support.
>
>
> Let's split out that discussion. My take on that is that we want to defer
> that to the query engines. In an ideal situation, the Iceberg integration
> should be part of the project itself (e.g. with Hive 4 where it is
> maintained by Hive itself). For Spark itself, it only requires a runtime to
> be added through the packages argument, and would love to see if we can
> avoid maintaining images for that.
>
> Kind regards,
> Fokko
>
>
> Op do 14 nov 2024 om 18:16 schreef Christian Thiel
> :
>
>> +1 for this as well – for us especially the REST TCK image would be nice.
>>
>>
>>
>> *From: *Bryan Keller 
>> *Date: *Thursday, 14. November 2024 at 17:13
>> *To: *dev@iceberg.apache.org 
>> *Subject: *Re: [PROPOSAL] Create Iceberg DockerHub repository
>>
>> +1 this would be great! Thanks JB.
>>
>>
>>
>> -Bryan
>>
>>
>>
>> On Nov 14, 2024, at 8:30 AM, Ajantha Bhat  wrote:
>>
>>
>>
>> +1 for setting up the DockerHub repo,
>>
>> We discussed about this already in
>> https://www.mail-archive.com/dev@iceberg.apache.org/msg07888.html
>>
>> Now that the Docker image PR is ready for the REST catalog adapter, we
>> can proceed with setting up the DockerHub repository.
>>
>> Regarding the Kafka Connect Docker image, I believe that if we maintain
>> it, we could also manage other integration images, such as those for Spark
>> and Trino with Iceberg. We should have a separate discussion on which
>> integration images Iceberg should officially support.
>>
>> For now, maintaining the REST catalog adapter image has already been
>> approved in earlier discussions, so let’s start with that.
>>
>> - Ajantha
>>
>>
>>
>> On Thu, Nov 14, 2024 at 9:45 PM Sung Yun  wrote:
>>
>> Hi JB,
>>
>> That sounds great!!
>>
>> The REST TCK /adapter docker image will be super useful for the Iceberg
>> subprojects as it will ensure that they have access to a light-weight REST
>> Catalog Server image with the latest features to run integration tests
>> against.
>>
>> Sung
>>
>> On 2024/11/14 15:41:04 Jean-Baptiste Onofré wrote:
>> > Hi folks,
>> >
>> > While reviewing https://github.com/apache/iceberg/pull/11283, we
>> > discussed having a DockerHub repository for Iceberg.
>> >
>> > I can create this repository, similar to other Apache projects (like
>> > for example https://hub.docker.com/r/apache/activemq-classic,
>> > https://hub.docker.com/r/apache/airflow, etc).
>> > I can create an iceberg group (on DockerHub), and committers can ask
>> > to join (in order to be able to push docker images).
>> >
>> > For now, the purpose of this DockerHub repo is to host:
>> > - Iceberg REST TCK docker images
>> > - Iceberg Kafka Connect docker images
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>>
>>
>>
>

Re: [VOTE][Go] Release Apache Iceberg Go v0.1.0 RC2

2024-11-14 Thread Kevin Liu

+1 (non-binding)

Verified checksum, signature, tests

I modified the verify_rc script to use artifacts from the apache dist
https://github.com/apache/iceberg-go/pull/205
```
dev/release/verify_rc.sh 0.1.0 2
```

Best,
Kevin Liu

On Thu, Nov 14, 2024 at 11:45 AM Jean-Baptiste Onofré 
wrote:

> +1 (non binding)
>
> I checked:
> - sign and hash are good (NB: sha256 is not required anymore, sha512 is
> enough)
> - LICENSE and NOTICE are OK
> - ASF header is present
> - no binary file found in the source distribution
> - build OK on my machine
>
> Thanks !
> Regards
> JB
>
> On Thu, Nov 14, 2024 at 8:20 PM Matt Topol  wrote:
> >
> > RC 2 has been uploaded! Sorry about that!
> >
> >
> > On Thu, Nov 14, 2024, 1:00 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Matt
> >>
> >> Can you please update the source distribution on dist.apache.org
> >> (https://dist.apache.org/repos/dist/dev/iceberg/) ?
> >> It's still the RC1 here.
> >> From an ASF standpoint, that's the only strictly required artifact.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On Thu, Nov 14, 2024 at 12:06 AM Matt Topol 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I would like to propose the following release candidate (RC2) of
> Apache Iceberg Go version v0.1.0.
> >> >
> >> > This release candidate is based on commit:
> 0921b84b53e3184a1867481bf1e1a22f5a059b5c [1]
> >> >
> >> > The source release rc2 is hosted at [2].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Iceberg Go v0.1.0
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Iceberg Go v0.1.0 because...
> >> >
> >> > Thanks!
> >> > --Matt
> >> >
> >> > [1]:
> https://github.com/apache/iceberg-go/commit/0921b84b53e3184a1867481bf1e1a22f5a059b5c
> >> > [2]: https://github.com/apache/iceberg-go/releases/v0.1.0-rc2
> >> > [3]:
> https://github.com/apache/iceberg-go/blob/main/dev/release/README.md#verify
>

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-26 Thread Kevin Liu

We merged the PR[1] to upgrade `upload-artifact` to V4. Thanks, Fokko for
the review.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg-python/pull/1371


On Mon, Nov 25, 2024 at 10:36 PM Jean-Baptiste Onofré 
wrote:

> Hi Kevin
>
> I did a quick search and I have the same feedback as you: only
> iceberg-python is impacted.
>
> Thanks for the PR !
>
> Regards
> JB
>
> On Mon, Nov 25, 2024 at 9:03 PM Kevin Liu  wrote:
> >
> > Hey folks,
> >
> > I did a code search for both `actions/upload-artifact` and
> `actions/download-artifact` in the related iceberg repos.
> > *
> https://grep.app/search?q=actions/upload-artifact%40v3&filter[repo.pattern][0]=apache/iceberg
> > *
> https://grep.app/search?q=actions/download-artifact&filter[repo.pattern][0]=apache/iceberg
> >
> > Only iceberg-python is affected. Here's the PR to update the relevant
> action, https://github.com/apache/iceberg-python/pull/1371
> >
> > Best,
> > Kevin Liu
> >
> > On Mon, Nov 25, 2024 at 10:36 AM Jacob Wujciak 
> wrote:
> >>
> >> Hello Everyone!
> >>
> >> I am writing to inform you of the imminent removal of the v3 artifact
> >> actions that was announced in [1]. Both actions/upload-artifact@v3*
> >> and actions/download-artifact@v3* will stop working in 10 days, on
> >> December 5, 2024! According to a quick code search this project is
> >> using one of the actions with a v3 tag in at least one of its repos.
> >>
> >> There are breaking changes in the usage of the upload action that will
> >> likely require changes other than bumping the version, please see [2].
> >> Make sure to update your workflows in time to avoid disruptions!
> >>
> >> If you have any questions or need help with the transition I'd
> >> recommend bui...@apache.org as the place to look for help.
> >>
> >> Regards
> >> Jacob Wujciak-Jens (assignUser)
> >>
> >> [1]:
> https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
> >> [2]:
> https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
>

Re: [DISCUSS] Apache Iceberg Summit 2025 - Selection Committee

2024-11-26 Thread Kevin Liu

Very excited about this. Happy to help!

Best,
Kevin Liu

On Tue, Nov 26, 2024 at 9:32 AM himadri pal  wrote:

> Great to see the planning for another big Iceberg event!!!
> Please consider me as a volunteer as well.
>
> Regards,
> Himadri Pal
>
>
> On Tue, Nov 26, 2024 at 9:27 AM rdb...@gmail.com  wrote:
>
>> I'd like to volunteer. Glad to see Iceberg Summit 2025 coming together!
>>
>> On Tue, Nov 26, 2024 at 1:42 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> As you probably know, we've been having discussions about the Iceberg
>>> Summit 2025.
>>>
>>> The PMC pre-approved the Iceberg Summit proposal, and one of the first
>>> steps is to put together a selection committee that will be
>>> responsible for choosing talks and guiding the process.
>>> Once we have a selection committee, I will complete the concrete
>>> proposal for the ASF and the Iceberg PMC to request the ability to use
>>> the name Iceberg/Apache Iceberg.
>>>
>>> If you'd like to help and be part of the selection committee, please
>>> volunteer in a reply to this thread. Since we likely can't include
>>> everyone that volunteers, I propose that the PMC should choose the
>>> final committee from the set of people that volunteer.
>>>
>>> We'll leave this open up to Dec 10th to give people time (as
>>> Thanksgiving is this week).
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>
>
> --
> Regards,
> Himadri Pal
>

[Proposal] Automating the PyIceberg Release Process

2024-12-02 Thread Kevin Liu

Hi everyone,

As the release manager for PyIceberg 0.8.0 and the upcoming 0.8.1 release,
I’ve taken some time to reflect on ways we could improve the release
process. I drew inspiration from the iceberg-go release process and
documented my notes here
<https://github.com/apache/iceberg-python/issues/1306>. I’ve also updated
the release instructions here
<https://py.iceberg.apache.org/how-to-release/>.

Currently, the release process is manual and prone to errors. My goal is to
automate it as much as possible, ideally transforming it into a
single-click process.

I’d like to gather your thoughts on two key ideas:

   1. Automating the release process to reduce manual steps and errors.
   2. Introducing nightly builds to PyPI once automation is in place (issue
   #872 <https://github.com/apache/iceberg-python/issues/872>).

The PyIceberg release process can be summarized in these steps:

   - Create a Release Candidate (RC)
   - Vote on the devlist
   - Promote the RC to a Final Release

I believe the *"*Create a Release Candidate*"* step can benefit the most
from automation. Here’s a breakdown of the current steps:

   - Create a tag for the Release Candidate (e.g., `0.8.1rc1`).
   - Generate artifacts (currently done using GitHub Actions).
   - Generate SHA-512 checksums and GPG signatures, then upload the
   artifacts to SVN.
   - Upload the artifacts to PyPI.

To automate these steps via GitHub Actions, we’d need to address the
following:

   - *GPG Signing*: GitHub Actions require a `GPG_PRIVATE_KEY` secret. I’ve
   tested this with my own key, but it would be better to create a new key
   (possibly owned by ASF) for signing files.
   - *SVN Uploads*: Uploading artifacts to SVN requires credentials. I
   haven’t tested this step yet, but we should aim to use credentials provided
   by ASF Infra instead of personal ones.
   - *PyPI Uploads*: Similarly, uploading to PyPI requires an API token,
   which should ideally be provided by ASF Infra.

I’ve begun automating the artifact generation process (PR #1391
<https://github.com/apache/iceberg-python/pull/1391>). However, the release
manager currently still needs to manually download and upload artifacts to
both SVN and PyPI.

Once the "Create a Release Candidate" step is automated, we can create a
GitHub Action to manually build and upload a nightly version to PyPi.


*Is this the direction we want to take for the release process? If so,
what’s the best way to coordinate with ASF Infra to create the necessary
credentials?*

I’d love to hear your thoughts and any additional suggestions.
Best,
Kevin Liu

Re: [DISCUSS] Removal of last-column-id of public API

2024-11-15 Thread Kevin Liu

Thanks for bringing this up Fokko.
It makes sense to hide `last-column-id` from the public API, as it is an
implementation detail.

As mentioned in the PR, I checked references to `last-column-id`
<https://grep.app/search?current=4&q=last-column-id&filter[lang][0]=Python&filter[lang][1]=Rust&filter[lang][2]=Java&filter[lang][3]=Go&filter[lang][4]=C%2B%2B>
 and `last_column_id` <https://grep.app/search?q=last_column_id> and didn't
find anything that would break due to this change.

We would likely need to also deprecate this in PyIceberg as well.
https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/table/update/__init__.py#L90-L91

Best,
Kevin Liu

On Thu, Nov 14, 2024 at 1:13 AM Fokko Driesprong  wrote:

> Hi everyone,
>
> While reviewing the TableMetadataBuilder PR on Iceberg-Rust
> <https://github.com/apache/iceberg-rust/pull/587#discussion_r1834400220>
> the other day, I noticed that it exposes the last-column-id to the public
> API, but I believe there is no need for it. This field is used to determine
> the next field-id when adding new fields to a schema. The last-column-id was
> added to the REST spec <https://github.com/apache/iceberg/pull/7445> a
> while ago, to make the spec in line with the reference implementation, but
> in hindsight, it should have been the other way around.
>
> My suggestion is to deprecate and remove this field from the spec and code
> <https://github.com/apache/iceberg/pull/11514/>, as I can't think of any
> use case where you want to make jumps in the last-column-id (it has to be
> monotonically increasing). This will help clean up the APIs and the
> reference implementation.
>
> Would love to hear everyone's thoughts on this!
>
> Kind regards,
> Fokko
>
>
>

Re: [DISCUSS] Add a implementation status page for iceberg

2024-11-15 Thread Kevin Liu

Thanks, Renjie! Happy to review and help fill out the matrix! :)

Best,
Kevin Liu

On Wed, Nov 13, 2024 at 10:51 PM Renjie Liu  wrote:

> Hi:
>
> Thanks for everyone's comments. I think we reached agreement on the
> design, and I'll send a pr for it.
>
> On Tue, Nov 12, 2024 at 1:13 AM Yufei Gu  wrote:
>
>> LGTM. Thanks Renjie!
>> Yufei
>>
>>
>> On Mon, Nov 11, 2024 at 5:38 AM Renjie Liu 
>> wrote:
>>
>>> Hi:
>>>
>>> > One minor suggestion: adding a table spec version label along with the
>>> feature in the support matrix. That doesn't apply to REST spec though.
>>>
>>> Updated the doc, please take a look.
>>>
>>> > My only comment is probably to use versions instead of check marks,
>>> but all good :)
>>>
>>> In current approach we will write the version of each library in the
>>> beginning of the page, which seems easier to maintain than per version per
>>> feature. What do you think?
>>>
>>> On Sat, Nov 9, 2024 at 5:12 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I like the idea. My only comment is probably to use versions instead
>>>> of check marks, but all good :)
>>>>
>>>> Thanks !
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Fri, Nov 8, 2024 at 3:33 PM Russell Spitzer
>>>>  wrote:
>>>> >
>>>> > Sounds like a great idea to me
>>>> >
>>>> > On Fri, Nov 8, 2024 at 7:58 AM Renjie Liu 
>>>> wrote:
>>>> >>
>>>> >> Hi:
>>>> >>
>>>> >> As iceberg evolved to a multi-lang project, I would like to propose
>>>> to maintain a status page for iceberg. For more details, please refer to
>>>> this doc.  Welcome to join the discussion and comment on it!
>>>> >>
>>>> >>
>>>>
>>>

Re: Retry ValidationException with concurrent writes to the same partition

2024-12-03 Thread Kevin Liu

Hi Ha,

Thanks for the question! Typically, when concurrent writes happen to the
same partition, the writer must retry the operation. This is because the
state of the table (and the partition) has changed, and the current write
cannot be safely applied to the updated state. For example, if the
operation involves overwriting or deleting data, it needs to be applied to
the new table state. In such cases, the client will need to regenerate both
the data files and the metadata files before retrying the commit.

However, there's an optimization available for certain types of write
operations that are independent of the new table state, such as appending
new data to a partition. In these cases, retrying the commit will only
require regenerating the metadata files, not the data files themselves.
This is what the proposal Yufei mentioned is referring to. Clients can
submit file-level changes, and the catalog server will generate the
necessary metadata before committing the new files.

Another option is to reduce concurrent write conflicts by creating
partition-level isolation. Iceberg supports multi-level partitioning, so
you can partition by multiple fields (e.g., partition by date and by
cluster). This can help isolate concurrent writes to different partitions.
However, this approach can introduce challenges such as the small-files
problem.

Hope this helps! Feel free to reach out if you'd like more detailed
guidance on implementing any of these approaches, or if there are other
specifics you'd like to discuss.

Best,
Kevin Liu

On Tue, Dec 3, 2024 at 3:12 PM Yufei Gu  wrote:

> If you’re looking for finer-grained isolation beyond the snapshot level,
> the closest feature currently *WIP* is *Fine-Grained Commit* in the REST
> catalog. You can find more details here: Fine-Grained Commit Design
> Document
> <https://docs.google.com/document/d/1n-cEE4-vFreTLnUTPgo7U8ih44MFo1ZHjyk4NCxxalc/edit?usp=sharing>
> .
>
> Yufei
>
>
> On Tue, Dec 3, 2024 at 2:41 PM Ha Cao  wrote:
>
>> Hello,
>>
>>
>>
>> I have some concurrent writes to the same partition and they overlap in
>> data files, and the isolation level is snapshot. Expectedly, I get this
>> ValidationException thrown from this line
>> <https://sourcegraph.com/github.com/linkedin/iceberg@78a31ff699dd019ed760ef70c8cbf8acbd74bed6/-/blob/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java?L266-271>.
>> I can certainly retry the write from the beginning, rewrite the metadata
>> and data files and commit again, but wonder if there is something internal
>> in Iceberg that can help me with that instead.
>>
>> Thank you!
>>
>> Best,
>>
>> Ha
>>
>

Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2025-01-07 Thread Kevin Liu

Hey folks,

Happy new year! I want to bump this thread with the freshed PR #11845
<https://github.com/apache/iceberg/pull/11845>. I've applied the
recommendations from this thread.
The PR replaces examples of Hadoop catalog in the Getting Started pages
with the JDBC Catalog along with an added example of configuring the REST
Catalog.

Please take a look and let me know what you think.

Best,
Kevin Liu

On Thu, Oct 17, 2024 at 6:10 AM Marc Cenac 
wrote:

> Hey Kevin,
>
> This approach sounds good to me and thanks for your work to improve
> the getting started docs!  I would consider using the file-based sqlite
> rather than in-memory since I've seen some users surprised when they
> realize their tables disappear from the catalog upon restart, but
> either way is a welcome change from the Hadoop catalog.
>
> Thanks!
> -Marc
>
> On Wed, Oct 16, 2024 at 1:42 PM Kevin Liu  wrote:
>
>> Hey folks,
>>
>>
>> Thanks for the discussions.
>>
>>
>> It seems everyone is in favor of replacing the Hadoop catalog example,
>> and the question now is whether to replace it with the JDBC catalog or the
>> REST catalog.
>>
>>
>> I originally proposed the JDBC catalog as a replacement primarily due to
>> its ease of use. Users can quickly set up a JDBC catalog backed by an
>> in-memory or file-based datastore without needing additional
>> infrastructure. It also aligns with the quick-start ethos of "it just
>> works." That said, I agree that an example of setting up the REST catalog
>> should be part of the getting-started guide since it’s the catalog the
>> community has aligned on.
>>
>>
>> Here's what I propose as a middle-ground.
>>
>>1. We replace the Hadoop catalog example with a JDBC catalog backed
>>by an in-memory datastore. This allows users to get started without 
>> needing
>>additional infrastructure, which was one of the main benefits of the 
>> Hadoop
>>catalog.
>>2. We add a new section describing the REST catalog, its benefits,
>>and how to set one up. We can use the REST catalog adapter [1], with the
>>adapter using the JDBC catalog as its internal catalog.
>>
>>
>> This approach gives users a way to quickly prototype while also guiding
>> them toward the REST catalog for production use cases.
>>
>>
>> Looking forward to hearing more from you all.
>>
>>
>> Best,
>>
>> Kevin Liu
>>
>>
>> [1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
>>
>>
>>
>> On Thu, Oct 10, 2024 at 3:44 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> I would prefer to advocate for the REST catalog in those examples/docs
>>> (similar to how the Spark quickstart example
>>> <https://iceberg.apache.org/spark-quickstart/> uses the REST catalog).
>>> The docs could then refer to the quickstart example to indicate what's
>>> required in terms of services to be started before a user can spawn a spark
>>> shell.
>>>
>>> On Thu, Oct 10, 2024 at 12:15 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> As we are talking about "documentation" (quick start/readme), I would
>>>> rather propose to use the REST catalog here instead of JDBC.
>>>>
>>>> As it's the catalog we "promote", I think it would be valuable for
>>>> users to start with the "right thing".
>>>>
>>>> JDBC Catalog is interesting for quick test/started guide, but we know
>>>> how it goes: it will be heavily use (see what happened with the
>>>> HadoopCatalog used in production whereas it should not :) ).
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu 
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I wanted to bring up a suggestion regarding our current
>>>> documentation. The existing examples for Iceberg often use the Hadoop
>>>> catalog, as seen in:
>>>> >
>>>> > Adding a Catalog - Spark Quickstart [1]
>>>> > Adding Catalogs - Spark Getting Started [2]
>>>> >
>>>> > Since we generally advise against using Hadoop catalogs in production
>>>> environments, I believe it would be beneficial to replace these examples
>>>> with ones that use the JDBC catalog. The JDBC catalog, configured with a
>>>> local SQLite database file, offers similar convenience but aligns better
>>>> with production best practices.
>>>> >
>>>> > I've created an issue [3] and a PR [4] to address this. Please take a
>>>> look, and I'd love to hear your thoughts on whether this is a direction we
>>>> want to pursue.
>>>> >
>>>> > Best,
>>>> > Kevin Liu
>>>> >
>>>> > [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
>>>> > [2]
>>>> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
>>>> > [3] https://github.com/apache/iceberg/issues/11284
>>>> > [4] https://github.com/apache/iceberg/pull/11285
>>>> >
>>>>
>>>

Re: [ANN] Apache Iceberg Summit 2025, dates, venue and CFP

2025-01-07 Thread Kevin Liu

Thanks for putting this together everyone. Looking forward to the event and
meeting in person!

Best,
Kevin Liu

On Tue, Jan 7, 2025 at 2:15 AM Jean-Baptiste Onofré  wrote:

> Hi everyone,
>
> With this new year comes a new announcement: Apache Iceberg Summit 2025 !
>
> Iceberg Summit 2025 is a hybrid event sanctioned by The Apache
> Software Foundation and organized by Dremio, Snowflake, and Microsoft.
> The summit aims to promote Apache Iceberg education and
> knowledge-sharing among data engineers, developers, architects and
> contributors.
>
> The event will take place at the Hyatt Regency SOMA in San Francisco,
> USA, on April 8 in person and Virtual on April 9 via the Bizzabo event
> platform. Featuring real-world talks from data practitioners and
> developers leveraging Apache Iceberg as their table format.
>
> The CFP is now open, so please, submit your talks here:
> https://sessionize.com/iceberg-summit-2025/
>
> The Apache Iceberg PMC settled the Selection Committee, responsible
> for selecting the talks for the Summit.
>
> If you are interested in sponsoring the event, please reach Russell
> (russell.spit...@gmail.com) or myself (jbono...@apache.org). We can
> share a prospectus and introduce you to the sponsors committee.
>
> We are working on the website for the event, I will share details soon.
>
> I would like to thank again the PMC members, and especially Russell,
> for their help and approval.
>
> I'm looking forward to the event and I'm sure we will have great talks ;)
>
> Regards
> JB
>

Re: [discuss] Allow 200 responses for HEAD requests in REST API

2025-01-07 Thread Kevin Liu

Hey folks,

Thanks for the feedback on the proposal.

I believe it’s best to retain the current 204 response code for HEAD
requests in the REST API. Existing client and server implementations that
adhere to the spec expect only 204 responses. Introducing an additional 200
response code would create backward compatibility issues and require all
existing clients to be updated.

While the MDN Web Docs' page on HEAD requests
<https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HEAD> and S3’s
HeadObject documentation
<https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html> suggest
using 200 responses for HEAD requests, it’s better to stick with 204 since
that’s what most servers and clients already support.

We can treat 200 responses as exceptions rather than the rule. Clients can
optionally handle 200 responses as a workaround while servers transition to
sending 204 responses to fully adhere to the spec.

Thanks all for the discussion!

Best,
Kevin Liu

On Wed, Dec 18, 2024 at 1:00 AM Xuanwo  wrote:

> Hi,
>
> From my initial understanding of HTTP semantics, the HEAD request should
> be treated like a GET request without a response body. Therefore, returning
> a 204 for a HEAD request does not align with the concept held by most
> developers. I support the idea of allowing a 200 response instead.
>
> For example, AWS S3 HeadObject also returns a 200 status code when the
> file exists.
>
> Ref: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HEAD
>
> On Wed, Dec 18, 2024, at 16:36, Fokko Driesprong wrote:
>
> Hey Kevin,
>
> I also agree with Yufei. For PyIceberg we had a long list of issues around
> the head request (#1363
> <https://github.com/apache/iceberg-python/issues/1363> gives a nice
> overview) to check if the table is there (and that also has just been
> added to Java <https://github.com/apache/iceberg/pull/10999>). Allowing
> 200 to unblock users quickly, but implementations should adhere to the
> spec, and we should be reluctant with this kind of fixes.
>
> Kind regards,
> Fokko
>
> Op wo 18 dec 2024 om 07:56 schreef Eduard Tudenhöfner <
> etudenhoef...@apache.org>:
>
> I agree with Yufei's observation. Changing the return code in the spec
> from 204 to 200 will just cause additional downstream work that doesn't
> seem worth it. Returning 204 makes the API also very explicit in telling
> that the request succeeded but that there's no content in the response that
> the client needs to care about.
>
> Eduard
>
> On Tue, Dec 17, 2024 at 10:20 PM Yufei Gu  wrote:
>
> The distinction between 200 and 204 is subtle enough that I'm comfortable
> using them interchangeably in this context. My main concern is that, if we
> make this change, all clients—except for PyIceberg—will need to be updated
> to support both 200 and 204, since a server could return either status
> code. It might not be worth it.
>
> Yufei
>
>
> On Tue, Dec 17, 2024 at 12:52 PM Kevin Liu  wrote:
>
> Hey folks,
>
> I’d like to propose adding status code 200 as a valid response for HEAD
> requests in the Catalog REST API. Currently, the following HEAD requests
> return status code 204 for a successful response:
> * namespaceExists
> <https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L372-L402>
> * tableExists
> <https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L1129-L1160>
> * viewExists
> <https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L1691-L1712>
>
> In PyIceberg, support for status code 200 has already been implemented for
> table_exists
> <https://github.com/apache/iceberg-python/pull/1389/files#diff-3bda7391ebd8aa3dcfd6703d8d2764830b9d9c35fa854188a37d69611274bd3dR890>
>  and namespace_exists
> <https://github.com/apache/iceberg-python/pull/1434/files#diff-3bda7391ebd8aa3dcfd6703d8d2764830b9d9c35fa854188a37d69611274bd3dR882>.
> The motivation for this change is to enable more intuitive and
> user-friendly integrations with catalogs, as Fokko highlighted here
> <https://github.com/apache/iceberg-python/issues/1363#issuecomment-2497462825>.
> Standardizing this behavior in the Catalog REST spec would promote
> consistency across implementations and make interactions easier for users
> and client developers.
> Would love to hear your thoughts on this proposal!
> Best,
> Kevin Liu
>
> Xuanwo
>
> https://xuanwo.io/
>
>

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-07 Thread Kevin Liu

+1 (non-binding)
It's great to see support for more data types in both parquet and Iceberg!

Best,
Kevin Liu

On Fri, Feb 7, 2025 at 12:11 PM huaxin gao  wrote:

> +1 (non-binding)
>
> On Fri, Feb 7, 2025 at 12:03 PM Honah J.  wrote:
>
>> +1
>>
>> Best regards,
>> Honah
>>
>> On Fri, Feb 7, 2025 at 10:45 AM Aihua Xu  wrote:
>>
>>> +1 (non-binding).
>>>
>>> On Fri, Feb 7, 2025 at 8:12 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> That's a great progress ! Thanks !
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Thu, Feb 6, 2025 at 9:01 PM Szehon Ho 
>>>> wrote:
>>>> >
>>>> > Hi everyone
>>>> >
>>>> > We would like to add Geometry and Geography types to the Iceberg V3
>>>> spec:
>>>> >
>>>> > https://github.com/apache/iceberg/pull/10981
>>>> >
>>>> > This is proposed together with Apache Parquet format change to
>>>> support geospatial data.
>>>> >
>>>> > https://github.com/apache/parquet-format/pull/240
>>>> >
>>>> > This vote will be open for at least 72 hours.
>>>> >
>>>> > [ ] +1 Add these types to the format specification
>>>> > [ ] +0
>>>> > [ ] -1 Do not add these types to the format specification because...
>>>> >
>>>> > Thanks,
>>>> > Szehon
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>

Re: [Announce] Apache Iceberg Europe Community Meetup

2025-02-07 Thread Kevin Liu

Excited to see this happening!
I just want to mention that the Call for Proposal (CFP) is open until March
9th, CFP link is here
<https://docs.google.com/forms/d/e/1FAIpQLSfuz7An-qZ2s-qbR4NxcTTcfjgHp7ZZ60e996KZPFy9DzDDOA/viewform>
and
also on the luma page <https://lu.ma/ewx2kuis>. Talks will be recorded and
made available on the Youtube Channel (@IcebergMeetup
<https://www.youtube.com/@IcebergMeetup>)!

Best,
Kevin Liu

On Wed, Feb 5, 2025 at 2:05 AM Christian Thiel 
wrote:

> Oh sorry - here is the correct link: https://lu.ma/ewx2kuis
>
> On Wed, 5 Feb 2025 at 10:11, Raúl Cumplido  wrote:
>
>> Thanks Christian,
>>
>> Good news! The link sent seems to be for an old Singapore Apache Iceberg
>> meetup.
>>
>>
>>
>> El mié, 5 feb 2025 a las 10:06, Christian Thiel (<
>> christian.t.b...@gmail.com>) escribió:
>>
>>> Hey everyone,
>>>
>>> Iceberg Meetups are coming to Europe!
>>> We are planning to regularly organize meetups across Europe, the first
>>> Meetup will be on the 2nd of April in Amsterdam starting at 17:00.
>>>
>>> Please check our luma page for signup and call for speakers:
>>> https://lu.ma/79xk5w5t
>>>
>>> Best,
>>> Christian
>>>
>>

Re: Welcome Huaxin Gao as a committer!

2025-02-06 Thread Kevin Liu

Congratulations Huaxin!! Looking forward to working together 🎉
<https://emojipedia.org/party-popper>

Best,
Kevin Liu

On Thu, Feb 6, 2025 at 7:30 AM Prashant Singh 
wrote:

> Congratulations Huaxin !
>
> Best,
> Prashant Singh
>
>
> On Thu, Feb 6, 2025 at 7:25 AM himadri pal  wrote:
>
>> Congratulations Huaxin.
>>
>> On Thu, Feb 6, 2025 at 6:45 AM Sung Yun  wrote:
>>
>>> That's fantastic news Huaxin. Congratulations!
>>>
>>> On 2025/02/06 13:40:09 Rodrigo Meneses wrote:
>>> > Congrats and best wishes !!!
>>> >
>>> > On Thu, Feb 6, 2025 at 5:04 AM Gidon Gershinsky 
>>> wrote:
>>> >
>>> > > Congrats Huaxin!
>>> > >
>>> > > Cheers, Gidon
>>> > >
>>> > >
>>> > > On Thu, Feb 6, 2025 at 2:46 PM Tushar Choudhary <
>>> > > tushar.choudhary...@gmail.com> wrote:
>>> > >
>>> > >> Congratulations Husain!
>>> > >>
>>> > >> Cheers,
>>> > >> Tushar Choudhary
>>> > >>
>>> > >>
>>> > >> On Thu, 6 Feb 2025 at 6:15 PM, xianjin  wrote:
>>> > >>
>>> > >>> Congrats huaxin!
>>> > >>> Sent from my iPhone
>>> > >>>
>>> > >>> On Feb 6, 2025, at 7:35 PM, Fokko Driesprong 
>>> wrote:
>>> > >>>
>>> > >>> 
>>> > >>>
>>> > >>> Congratulations Huaxin!
>>> > >>>
>>> > >>> Op do 6 feb 2025 om 12:21 schreef Russell Spitzer <
>>> > >>> russell.spit...@gmail.com>:
>>> > >>>
>>> > >>>> Congratulations!
>>> > >>>>
>>> > >>>> On Thu, Feb 6, 2025 at 11:35 AM Péter Váry <
>>> peter.vary.apa...@gmail.com>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> Congratulations!
>>> > >>>>>
>>> > >>>>> Matt Topol  ezt írta (időpont: 2025.
>>> febr.
>>> > >>>>> 6., Cs, 10:40):
>>> > >>>>>
>>> > >>>>>> Congrats! Welcome!
>>> > >>>>>>
>>> > >>>>>> On Thu, Feb 6, 2025, 10:19 AM Raúl Cumplido 
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Congrats Huaxin!
>>> > >>>>>>>
>>> > >>>>>>> El jue, 6 feb 2025 a las 10:16, Gang Wu ()
>>> > >>>>>>> escribió:
>>> > >>>>>>>
>>> > >>>>>>>> Congrats Huaxin!
>>> > >>>>>>>>
>>> > >>>>>>>> Best,
>>> > >>>>>>>> Gang
>>> > >>>>>>>>
>>> > >>>>>>>> On Thu, Feb 6, 2025 at 5:10 PM Szehon Ho <
>>> szehon.apa...@gmail.com>
>>> > >>>>>>>> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> Hi everyone,
>>> > >>>>>>>>>
>>> > >>>>>>>>> The Project Management Committee (PMC) for Apache Iceberg has
>>> > >>>>>>>>> invited Huaxin Gao to become a committer, and I am happy to
>>> > >>>>>>>>> announce that she has accepted.  Huaxin has done a lot
>>> > >>>>>>>>> of impressive work in areas such as Iceberg-Spark
>>> integration and recently
>>> > >>>>>>>>> Iceberg-Comet integrations.  Thanks Huaxin for all your hard
>>> work!
>>> > >>>>>>>>>
>>> > >>>>>>>>> Please join us in welcoming her!
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thanks,
>>> > >>>>>>>>> Szehon
>>> > >>>>>>>>> On behalf of the Iceberg PMC
>>> > >>>>>>>>>
>>> > >>>>>>>>
>>> >
>>>
>>
>>
>> --
>> Regards,
>> Himadri Pal
>>
>

Re: [VOTE] Release Apache Iceberg 1.8.0 RC0

2025-02-11 Thread Kevin Liu

+1 (non binding)

Checked signature, checksum, license, and tests.

Had a few flaky tests running on M1 Mac, listed below. I reran the tests on
ubuntu using github runners
<https://github.com/kevinjqliu/iceberg-python/actions/runs/13251043031/job/36988731699>
and it completed successfully.
I also tested against pyiceberg's integration tests on my fork
<https://github.com/kevinjqliu/iceberg-python/pull/9>.

Thanks for running the release!

Best,
Kevin Liu


Flaky tests:
```
> Task :iceberg-aws:test

TestS3FileIO > testDeleteFilesSingleBatchWithRemainder() FAILED
org.apache.iceberg.io.BulkDeletionFailureException: Failed to delete 18
files
at
app//org.apache.iceberg.aws.s3.S3FileIO.deleteFiles(S3FileIO.java:240)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testBatchDelete(TestS3FileIO.java:231)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testDeleteFilesSingleBatchWithRemainder(TestS3FileIO.java:186)

TestS3FileIO > testDeleteFilesMultipleBatches() FAILED
org.apache.iceberg.io.BulkDeletionFailureException: Failed to delete 30
files
at
app//org.apache.iceberg.aws.s3.S3FileIO.deleteFiles(S3FileIO.java:240)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testBatchDelete(TestS3FileIO.java:231)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testDeleteFilesMultipleBatches(TestS3FileIO.java:176)

TestS3FileIO > testDeleteFilesLessThanBatchSize() FAILED
org.apache.iceberg.io.BulkDeletionFailureException: Failed to delete 12
files
at
app//org.apache.iceberg.aws.s3.S3FileIO.deleteFiles(S3FileIO.java:240)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testBatchDelete(TestS3FileIO.java:231)
at
app//org.apache.iceberg.aws.s3.TestS3FileIO.testDeleteFilesLessThanBatchSize(TestS3FileIO.java:181)

> Task :iceberg-core:test

TestHadoopCommits > testConcurrentFastAppends(File) FAILED
org.awaitility.core.ConditionTimeoutException: Condition with Lambda
expression in org.apache.iceberg.hadoop.TestHadoopCommits was not fulfilled
within 10 seconds.
at
app//org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at
app//org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
at
app//org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
at
app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
at
app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
at
app//org.apache.iceberg.hadoop.TestHadoopCommits.lambda$testConcurrentFastAppends$3(TestHadoopCommits.java:462)
```

On Mon, Feb 10, 2025 at 11:45 AM Anurag Mantripragada <
anuragmantr...@gmail.com> wrote:

> +1
>
> I verified signature, checksums, license, built and ran tests locally.
>
> Thanks for taking care of the release, Amogh!
>
> Thanks,
> Anurag
>
> On Sun, Feb 9, 2025 at 10:39 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official Apache Iceberg
>> 1.8.0 release.
>>
>> The commit ID is c277c2014a1b37fe755cfe37f173b6465bb8cb73
>> * This corresponds to the tag: apache-iceberg-1.8.0-rc0
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.8.0-rc0
>> *
>> https://github.com/apache/iceberg/tree/c277c2014a1b37fe755cfe37f173b6465bb8cb73
>>
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.8.0-rc0
>>
>> You can find the KEYS file here:
>> * https://downloads.apache.org/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1182/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 1.8.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> non-binding votes. This vote will pass if there are 3 binding +1 votes
>> and more binding
>> +1 votes than -1 votes.
>>
>

Re: [Announce] Singapore Apache Iceberg Community Meetup

2024-12-16 Thread Kevin Liu

Hey Denny,

Here's one with a better resolution,
https://drive.google.com/file/d/1H2scgq70fJU8AMLXzOadOdKfVPjHLDH9/view?usp=sharing

Best,
Kevin Liu

On Mon, Dec 16, 2024 at 10:35 AM Denny Lee  wrote:

> Hey Kevin,
>
> Do you have a bigger image for this event so we can help promote  the
> event via social media?
>
> Thanks!
> Denny
>
>
> On Fri, Dec 13, 2024 at 3:06 PM Kevin Liu  wrote:
>
>> Hey everyone,
>>
>> The Iceberg community meetup has expanded to Singapore! The next
>> Singapore meetup will be on Wednesday, December 18 from 5:00 PM to 8:00
>> PM at 16 Collyer Quay, level 12
>> Singapore.
>>
>> Here's the luma page to sign up for the event, https://lu.ma/79xk5w5t
>>
>> There will be 3 presentations from community members.
>> * Rayner Chen (Tech VP at VeloDB) - Building Fast Data Lake Analysis on
>> top of Apache Iceberg & Apache Doris
>> * Xinyu Zhou (Co-Founder & CTO at AutoMQ) - AutoMQ Table Topic: Bridging
>> Streaming and Analytics with Iceberg
>> * Jay Chia (Co-Founder at Eventual (Daft)) - Iceberg in Python
>>
>> Presentations will be recorded and uploaded to the Iceberg meetup YouTube
>> channel (https://www.youtube.com/@IcebergMeetup)
>>
>> Best,
>> Kevin Liu
>>
>>

[discuss] Allow 200 responses for HEAD requests in REST API

2024-12-17 Thread Kevin Liu

Hey folks,

I’d like to propose adding status code 200 as a valid response for HEAD
requests in the Catalog REST API. Currently, the following HEAD requests
return status code 204 for a successful response:
* namespaceExists
<https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L372-L402>
* tableExists
<https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L1129-L1160>
* viewExists
<https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L1691-L1712>

In PyIceberg, support for status code 200 has already been implemented for
table_exists
<https://github.com/apache/iceberg-python/pull/1389/files#diff-3bda7391ebd8aa3dcfd6703d8d2764830b9d9c35fa854188a37d69611274bd3dR890>
 and namespace_exists
<https://github.com/apache/iceberg-python/pull/1434/files#diff-3bda7391ebd8aa3dcfd6703d8d2764830b9d9c35fa854188a37d69611274bd3dR882>.
The motivation for this change is to enable more intuitive and
user-friendly integrations with catalogs, as Fokko highlighted here
<https://github.com/apache/iceberg-python/issues/1363#issuecomment-2497462825>.
Standardizing this behavior in the Catalog REST spec would promote
consistency across implementations and make interactions easier for users
and client developers.

Would love to hear your thoughts on this proposal!

Best,
Kevin Liu

Re: [DISCUSS] Remove snapshot-id from IRC SetStatisticsUpdate

2024-12-17 Thread Kevin Liu

Hey Christian,

Thanks for bringing this up! We also noticed this issue while implementing
table statistics in Python [1].
I'm in favor of removing the outer field. Since this is part of the spec
change, we would need to follow the proper deprecation and removal path,
similar to what we did for `last-column-id` in #11514
<https://github.com/apache/iceberg/pull/11514>.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg-python/pull/1285

On Mon, Dec 16, 2024 at 5:30 AM Fokko Driesprong  wrote:

> Hey Christian,
>
> Great catch, I would also be in favor of removing the outer one. I don't
> see any value in having them both.
>
> Kind regards,
> Fokko
>
> Op ma 16 dec 2024 om 14:26 schreef Jean-Baptiste Onofré :
>
>> Hi,
>>
>> I saw the discussion on Slack. Yeah, it's redundant.
>> I know some catalogs only consider the snapshot id in SetStatisticsUpdate.
>>
>> Regards
>> JB
>>
>> On Fri, Dec 13, 2024 at 8:03 PM Christian 
>> wrote:
>> >
>> > Dear all,
>> >
>> > I believe we currently have a redundancy in the IRC SetStatisticsUpdate
>> [1].
>> > SetStatisticsUpdate has a required field `snapshot-id` but also a
>> `StatisticsFile` which in turn contains the `snapshot-id` as required. The
>> redundant information is used in Java only for an assertion to check if the
>> ids are identical [2].
>> >
>> > Are there any good reasons to keep both `snapshot-id`s? If not I would
>> propose to deprecate the outer `snapshot-id`.
>> > To remove redundancy in libraries, I am using a custom serializer /
>> deserializer to handle this in Rust [3].
>> >
>> > Let me know what you think!
>> >
>> > Thanks,
>> > Christian
>> >
>> > [1]:
>> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L2902
>> > [2]:
>> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1314
>> > [3]: https://github.com/apache/iceberg-rust/pull/799
>> >
>> >
>>
>

Re: [VOTE] Release Apache Iceberg Rust 0.4.0 RC1

2024-12-17 Thread Kevin Liu

Hey Sung,

Thanks for working on the 0.4.0 release! I went through a few steps to
verify this release and ran into an issue verifying the signature.

Cannot check the signature:
```
➜  curl https://downloads.apache.org/iceberg/KEYS -o KEYS
gpg --import KEYS
➜ gpg --verify apache-iceberg-rust-0.4.0-src.tar.gz.asc
apache-iceberg-rust-0.4.0-src.tar.gz
gpg: WARNING: unsafe permissions on homedir '/Users/kevinliu/.gnupg'
gpg: Signature made Tue Dec 17 13:23:11 2024 PST
gpg:using RSA key D41D8CC8DED1FD6495077949B6847531A1883DA4
gpg: Can't check signature: No public key
```

Checksum is OK
```
➜  shasum -a 512 --check apache-iceberg-rust-0.4.0-src.tar.gz.sha512
apache-iceberg-rust-0.4.0-src.tar.gz: OK
```

The verify script requires `chmod` to execute, but this is not a blocker.
```
chmod +x ./scripts/verify.py
```

Best,
Kevin Liu

On Tue, Dec 17, 2024 at 1:50 PM Sung Yun  wrote:

> Hello, Apache Iceberg Rust Community,
>
> This is a call for a vote to release Apache Iceberg rust version
> v0.4.0-rc.1.
>
> The tag to be voted on is v0.4.0-rc.1.
>
> The release candidate:
>
>
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-rust-0.4.0-rc.1/
>
> Keys to verify the release candidate:
>
> https://downloads.apache.org/iceberg/KEYS
>
> Git tag for the release:
>
> https://github.com/apache/iceberg-rust/releases/tag/v0.4.0-rc.1
>
> The associated convenience artifact for pyiceberg_core can be
> downloaded by running the following command:
>
> `pip install -i https://test.pypi.org/simple/ pyiceberg-core`
>
> All notable features and fixes introduced in this release are
> documented in the changelog:
>
> https://github.com/apache/iceberg-rust/blob/main/CHANGELOG.md
>
> Please download, verify, and test.
>
> The VOTE will be open for at least 72 hours and until the necessary
> number of votes are reached.
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> To learn more about Apache Iceberg, please see
> https://rust.iceberg.apache.org/
>
> Checklist for reference:
>
> [ ] Download links are valid.
> [ ] Checksums and signatures.
> [ ] LICENSE/NOTICE files exist
> [ ] No unexpected binary files
> [ ] All source files have ASF headers
> [ ] Can compile from source
>
> More detailed checklist please refer to:
> https://github.com/apache/iceberg-rust/tree/main/scripts
>
> To compile from source, please refer to:
> https://github.com/apache/iceberg-rust/blob/main/CONTRIBUTING.md
>
> Here is a Python script in release to help you verify the release
> candidate:
>
> ./scripts/verify.py
>
> Thank you!
>
> Sung
>

Re: [VOTE] Release Apache Iceberg Rust 0.4.0 RC2

2024-12-18 Thread Kevin Liu

Hey Sung,

Thanks for the new RC. I've run the following verification steps.
[x] Download links are valid.
[x] Checksums and signatures.
[x] LICENSE/NOTICE files exist
[x] No unexpected binary files
[x] All source files have ASF headers
[x] Can compile from source
[x] `./scripts/verify.py` (with `chmod +x`)
[ ]`make test`

I'm having trouble running the tests successfully. See the error log below.
Running a single test works, i.e. `cargo test -p iceberg --test
file_io_gcs_test`.

Are others running into the same issue?

Error log:
```
test tests::test_file_io_s3_output ... FAILED
test tests::test_file_io_s3_exists ... FAILED
test tests::test_file_io_s3_input ... FAILED

failures:

 tests::test_file_io_s3_output stdout 
thread 'tests::test_file_io_s3_output' panicked at
crates/iceberg/tests/file_io_s3_test.rs:80:67:
called `Result::unwrap()` on an `Err` value: Unexpected => Failure in doing
io operation

Source: Unexpected (persistent) at stat, context: { url:
http://172.21.0.2:9000/bucket1/test_output, called:
http_util::Client::send, service: s3, path: test_output } => send http
request, source: error sending request for url (
http://172.21.0.2:9000/bucket1/test_output): error sending request for url (
http://172.21.0.2:9000/bucket1/test_output): client error (Connect): tcp
connect error: Operation timed out (os error 60): Operation timed out (os
error 60)


 tests::test_file_io_s3_exists stdout 
thread 'tests::test_file_io_s3_exists' panicked at
crates/iceberg/tests/file_io_s3_test.rs:73:59:
called `Result::unwrap()` on an `Err` value: Unexpected => Failure in doing
io operation

Source: Unexpected (persistent) at stat, context: { url:
http://172.21.0.2:9000/bucket2/any, called: http_util::Client::send,
service: s3, path: any } => send http request, source: error sending
request for url (http://172.21.0.2:9000/bucket2/any): error sending request
for url (http://172.21.0.2:9000/bucket2/any): client error (Connect): tcp
connect error: Operation timed out (os error 60): Operation timed out (os
error 60)


 tests::test_file_io_s3_input stdout 
thread 'tests::test_file_io_s3_input' panicked at
crates/iceberg/tests/file_io_s3_test.rs:93:58:
called `Result::unwrap()` on an `Err` value: Unexpected => Failure in doing
io operation

Source: Unexpected (persistent) at Writer::close, context: { url:
http://172.21.0.2:9000/bucket1/test_input, called: http_util::Client::send,
service: s3, path: test_input, written: 10 } => send http request, source:
error sending request for url (http://172.21.0.2:9000/bucket1/test_input):
error sending request for url (http://172.21.0.2:9000/bucket1/test_input):
client error (Connect): tcp connect error: Operation timed out (os error
60): Operation timed out (os error 60)

note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace


failures:
tests::test_file_io_s3_exists
tests::test_file_io_s3_input
tests::test_file_io_s3_output

test result: FAILED. 0 passed; 3 failed; 0 ignored; 0 measured; 0 filtered
out; finished in 307.08s
```

Best,
Kevin Liu


On Tue, Dec 17, 2024 at 11:58 PM Renjie Liu  wrote:

> +1 binding.
>
> Did following verification:
>
> [*] Download links are valid.
> [*] Checksums and signatures.
> [*] LICENSE/NOTICE files exist
> [*] No unexpected binary files
> [*] All source files have ASF headers
> [*] Can compile from source
>
> Running `make test` in following platforms and it works!
> - macos + m4 + orbstack(drop in replacement for docker)
> - ubuntu 22.04 + docker
>
> On Wed, Dec 18, 2024 at 2:27 PM Xuanwo  wrote:
>
>> +1 non-binding
>>
>> Thank you for carrying this release, seems nice!
>>
>> [x] Download links are valid.
>> [x] Checksums and signatures.
>>
>> :) for i in *.tar.gz; do
>>  gpg --verify $i.asc $i
>>  sha512sum -c $i.sha512
>> done
>> gpg: Signature made Wed 18 Dec 2024 09:01:45 AM CST
>> gpg:using RSA key 736A14A51AA5E56B580312A59816959ADEB8F9E6
>> gpg: checking the trustdb
>> gpg: Note: ultimately trusted key 71751399FB39CB84 expired
>> gpg: Note: ultimately trusted key 0C69C1EF41181E13 expired
>> gpg: Note: ultimately trusted key 9476842D24B7C885 expired
>> gpg: marginals needed: 3  completes needed: 1  trust model: pgp
>> gpg: depth: 0  valid:  30  signed:   2  trust: 0-, 0q, 0n, 0m, 0f, 30u
>> gpg: depth: 1  valid:   2  signed:   1  trust: 2-, 0q, 0n, 0m, 0f, 0u
>> gpg: next trustdb check due at 2026-10-27
>> gpg: Good signature from "Sung Yun (CODE SIGNING KEY) "
>> [ultimate]
>> apache-iceberg-rust-0.4.0-src.tar.gz: OK
>>
>> [x] LICENSE/NOTICE files exist
>> [x] No unexpected binary files
>> [x] All source files have ASF headers
>> [x] Can compile from source
>>
>>

Re: [VOTE] Release Apache Iceberg Rust 0.4.0 RC2

2024-12-18 Thread Kevin Liu

Hey folks,

Following up on my previous email. Turns out there are some limitations
with running tests on MacOS with Native Docker, this is already documented
at
https://github.com/apache/iceberg-rust/blob/main/CONTRIBUTING.md#install-docker-or-podman
After installing OrbStack as suggested, I was able to run the tests
successfully. I see that Fokko also ran into this issue in
apache/iceberg-rust/#748 <https://github.com/apache/iceberg-rust/pull/748>.
```
cargo test --features storage-gcs --test file_io_gcs_test
make test
```
Making a note here so others can unblock themselves.

Regarding this RC, +1 (non-binding) . I've verified the following:
[x] Download links are valid.
[x] Checksums and signatures.
[x] LICENSE/NOTICE files exist
[x] No unexpected binary files
[x] All source files have ASF headers
[x] Can compile from source
[x] `./scripts/verify.py` (with `chmod +x`)
[x] `make test` macos m1 + orbstack
[x] Built and ran tests for `pyiceberg_core` following
https://github.com/apache/iceberg-rust/tree/main/bindings/python
[x] Installed and ran an example of utilizing
`pyiceberg_core.transform.bucket` function
```
import pyarrow as pa
import pyiceberg_core

data = [1, 2, 3, 4, 5, 6]
pyarrow_array = pa.array(data)
num_buckets = 2
bucketed_result = pyiceberg_core.transform.bucket(pyarrow_array,
num_buckets)

print(f"Bucketed Result: {bucketed_result}")
```

Best,
Kevin Liu



On Wed, Dec 18, 2024 at 8:41 AM Fokko Driesprong  wrote:

> Hey Kevin,
>
> Ran into the same thing :) Currently, the tests don't support Docker
> <https://github.com/apache/iceberg-rust/pull/748>, I've switched to Podman
> <https://podman.io/> and it works like a charm.
>
> Kind regards,
> Fokko
>
> Op wo 18 dec 2024 om 16:46 schreef Kevin Liu :
>
>> Hey Sung,
>>
>> Thanks for the new RC. I've run the following verification steps.
>> [x] Download links are valid.
>> [x] Checksums and signatures.
>> [x] LICENSE/NOTICE files exist
>> [x] No unexpected binary files
>> [x] All source files have ASF headers
>> [x] Can compile from source
>> [x] `./scripts/verify.py` (with `chmod +x`)
>> [ ]`make test`
>>
>> I'm having trouble running the tests successfully. See the error log
>> below. Running a single test works, i.e. `cargo test -p iceberg --test
>> file_io_gcs_test`.
>>
>> Are others running into the same issue?
>>
>> Error log:
>> ```
>> test tests::test_file_io_s3_output ... FAILED
>> test tests::test_file_io_s3_exists ... FAILED
>> test tests::test_file_io_s3_input ... FAILED
>>
>> failures:
>>
>>  tests::test_file_io_s3_output stdout 
>> thread 'tests::test_file_io_s3_output' panicked at
>> crates/iceberg/tests/file_io_s3_test.rs:80:67:
>> called `Result::unwrap()` on an `Err` value: Unexpected => Failure in
>> doing io operation
>>
>> Source: Unexpected (persistent) at stat, context: { url:
>> http://172.21.0.2:9000/bucket1/test_output, called:
>> http_util::Client::send, service: s3, path: test_output } => send http
>> request, source: error sending request for url (
>> http://172.21.0.2:9000/bucket1/test_output): error sending request for
>> url (http://172.21.0.2:9000/bucket1/test_output): client error
>> (Connect): tcp connect error: Operation timed out (os error 60): Operation
>> timed out (os error 60)
>>
>>
>>  tests::test_file_io_s3_exists stdout 
>> thread 'tests::test_file_io_s3_exists' panicked at
>> crates/iceberg/tests/file_io_s3_test.rs:73:59:
>> called `Result::unwrap()` on an `Err` value: Unexpected => Failure in
>> doing io operation
>>
>> Source: Unexpected (persistent) at stat, context: { url:
>> http://172.21.0.2:9000/bucket2/any, called: http_util::Client::send,
>> service: s3, path: any } => send http request, source: error sending
>> request for url (http://172.21.0.2:9000/bucket2/any): error sending
>> request for url (http://172.21.0.2:9000/bucket2/any): client error
>> (Connect): tcp connect error: Operation timed out (os error 60): Operation
>> timed out (os error 60)
>>
>>
>>  tests::test_file_io_s3_input stdout 
>> thread 'tests::test_file_io_s3_input' panicked at
>> crates/iceberg/tests/file_io_s3_test.rs:93:58:
>> called `Result::unwrap()` on an `Err` value: Unexpected => Failure in
>> doing io operation
>>
>> Source: Unexpected (persistent) at Writer::close, context: { url:
>> http://172.21.0.2:9000/bucket1/test_input, called:
>> http_util::Client::send, service: s3, path: test_input, written: 10 } =>
>> send http request, source: error sending

Re: [ANNOUNCE] Apache Iceberg release 1.7.1

2024-12-13 Thread Kevin Liu

Thanks for driving the release Bryan!

Best,
Kevin Liu

On Mon, Dec 9, 2024 at 10:45 PM Eduard Tudenhöfner 
wrote:

> Thanks Bryan and everyone else for making this release happen.
>
> On Tue, Dec 10, 2024 at 12:27 AM Yuya Ebihara  wrote:
>
>> Thank you Brian! Trino project had waited for 1.7.1 that fixes namespace
>> regression.
>>
>

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-12-13 Thread Kevin Liu

Hey folks,

I want to add a few references here to close the loop.

The docker image is available on docker hub under
`apache/iceberg-rest-fixture`,
https://hub.docker.com/r/apache/iceberg-rest-fixture
And a few iceberg subprojects are already using the image
* iceberg-rust
<https://github.com/apache/iceberg-rust/blob/2e0b64646fcfbd909788236a251a3a374a193542/crates/integration_tests/testdata/docker-compose.yaml#L23>
* iceberg-go
<https://github.com/apache/iceberg-go/blob/88bbae37af6b24998fc334831f4d63cd444aac1e/dev/docker-compose.yml#L42>
* iceberg-python
<https://github.com/apache/iceberg-python/blob/a97d13c17cd03f86252b9df2c65532ec45fb05da/dev/docker-compose-integration.yml#L44>

Thanks everyone for making this happen!

Best,
Kevin Liu

On Mon, Dec 9, 2024 at 12:15 AM Jean-Baptiste Onofré 
wrote:

> Hi Piotr
>
> That's a good point.
>
> As DockerHub is managed by The ASF, I think it's worth it to have
> docker images hosted there at least. That said, I don't see a problem
> with publishing on GH Packages.
>
> Regards
> JB
>
> On Thu, Dec 5, 2024 at 2:57 PM Piotr Findeisen
>  wrote:
> >
> > Hi,
> >
> > Sorry for coming late here.
> > Did we consider GitHub packages as a home of the Apache docker images?
> > We already use GitHub for development and GitHub packages are better
> integrated with GitHub.
> > In my personal opinion github packages are also less likely to be rate
> limited.
> >
> > Best
> > Piotr
> >
> >
> >
> >
> > On Fri, 22 Nov 2024 at 19:03, Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi
> >>
> >> That's correct: in Sung's PR, I can see the secret.DOCKERHUB_USER and
> >> secret.DOCKERHUB_TOKEN.
> >> So, we should be able to publish docker images via this GitHub action ;)
> >>
> >> Regards
> >> JB
> >>
> >> On Fri, Nov 22, 2024 at 6:16 PM Fokko Driesprong 
> wrote:
> >> >
> >> > I think Sung beat you to it:
> https://github.com/apache/iceberg/pull/11632
> >> >
> >> > As mentioned earlier it would be awesome if we could have a nightly
> build so we can test all the different languages against the nightly. In
> this case, when there are changes or new features, we can test/implement
> them right away.
> >> >
> >> > Kind regards,
> >> > Fokko
> >> >
> >> > Op vr 22 nov 2024 om 18:11 schreef Kevin Liu  >:
> >> >>
> >> >> Thanks for setting this up, JB! It looks like PR #11283 is close to
> being merged.
> >> >>
> >> >> What is the deployment strategy for the Docker image? Ideally, this
> process could be fully automated using GitHub and GitHub Actions.
> >> >>
> >> >> I’d love to hear everyone’s thoughts on this!
> >> >>
> >> >> Best regards,
> >> >> Kevin Liu
> >> >>
> >> >>
> >> >> On Fri, Nov 22, 2024 at 6:06 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >>>
> >> >>> Hi folks,
> >> >>>
> >> >>> I created the iceberg repo on DockerHub (in the Apache org):
> >> >>>
> >> >>> https://hub.docker.com/r/apache/iceberg
> >> >>>
> >> >>> I created an "Iceberg team" on DockerHub.
> >> >>>
> >> >>> I created DOCKERHUB_USER and DOCKERHUB_TOKEN credentials for the
> >> >>> Iceberg repo. That will allow us to directly push on DockerHub repo
> >> >>> from GitHub Action.
> >> >>> I also added Fokko to the repo.
> >> >>>
> >> >>> If you are a committer and you want to get permission on the Iceberg
> >> >>> DockerHub repo, please let me know, I will add your DockerHub
> account
> >> >>> to the "iceberg team".
> >> >>>
> >> >>> Thanks !
> >> >>>
> >> >>> Regards
> >> >>> JB
> >> >>>
> >> >>> On Fri, Nov 15, 2024 at 7:39 PM Kevin Liu 
> wrote:
> >> >>> >
> >> >>> > +1 to Iceberg REST TCK docker image. Thanks, JB for driving this
> and Ajantha for setting up the docker image.
> >> >>> > We already found a bug in PyIceberg [1] from integrating with the
> TCK docker image. It would be great to have a nightly build, perhaps we can
> set up a Github Action to automate the docker image publishing

Re: New committer: Scott Donnelly

2024-12-13 Thread Kevin Liu

Congrats Scott! Looking forward to collaborating with you in the future.

Best,
Kevin Liu

On Wed, Dec 11, 2024 at 9:26 PM NOTME ZE  wrote:

> Congratulations Scott!
>
> huaxin gao  于2024年12月12日周四 06:12写道：
>
>> Congratulations Scott!
>>
>> On Wed, Dec 11, 2024 at 9:07 AM Steve Zhang
>>  wrote:
>>
>>> Congratulations Scott!
>>>
>>> Thanks,
>>> Steve Zhang
>>>
>>>
>>>
>>> On Dec 11, 2024, at 4:47 AM, Fokko Driesprong  wrote:
>>>
>>> Congratulations Scott!
>>>
>>>
>>>

Re: New committer: Matt Topol

2024-12-13 Thread Kevin Liu

Congrats Matt!!

On Wed, Dec 11, 2024 at 2:12 PM huaxin gao  wrote:

> Congratulations, Matt!
>
> On Tue, Dec 10, 2024 at 11:13 PM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Congrats Matt!
>>
>> On Wed, Dec 11, 2024 at 6:41 AM Honah J.  wrote:
>>
>>> Congratulations, Matt!
>>>
>>> On Tue, Dec 10, 2024 at 7:51 PM Fenil Jain  wrote:
>>>
 Congratulations Matt!

 On Tue, Dec 10, 2024 at 3:56 PM Fokko Driesprong 
 wrote:
 >
 > Hey everyone,
 >
 > The Project Management Committee (PMC) for Apache Iceberg has invited
 Matt Topol to become a committer. Matt has done amazing work at
 kickstarting Iceberg-Go, and we are pleased to announce that he has
 accepted.
 >
 > Please join us in welcoming Matt to their new role and responsibility
 in our project community.
 >
 > Fokko Driesprong
 > On behalf of the Iceberg PMC

>>>

[Announce] Singapore Apache Iceberg Community Meetup

2024-12-13 Thread Kevin Liu

Hey everyone,

The Iceberg community meetup has expanded to Singapore! The next Singapore
meetup will be on Wednesday, December 18 from 5:00 PM to 8:00 PM at 16
Collyer Quay, level 12
Singapore.

Here's the luma page to sign up for the event, https://lu.ma/79xk5w5t

There will be 3 presentations from community members.
* Rayner Chen (Tech VP at VeloDB) - Building Fast Data Lake Analysis on top
of Apache Iceberg & Apache Doris
* Xinyu Zhou (Co-Founder & CTO at AutoMQ) - AutoMQ Table Topic: Bridging
Streaming and Analytics with Iceberg
* Jay Chia (Co-Founder at Eventual (Daft)) - Iceberg in Python

Presentations will be recorded and uploaded to the Iceberg meetup YouTube
channel (https://www.youtube.com/@IcebergMeetup)

Best,
Kevin Liu

Re: [DISCUSS] December board report

2024-12-13 Thread Kevin Liu

Hi Ryan,

Thanks for putting together the report!

I have a couple of items that might be helpful to include.

PyIceberg
* Updated the release process and documentation
* Updated integration tests to use the TCK REST catalog docker image
(`apache/iceberg-rest-fixture`) built using the apache/iceberg repo
* Added manifest files caching
* Added support for all metadata tables via the Table Inspect API
* Added support for High Availability mode for Hive Metastore
* Removed `numpy` as a hard dependency
* Pyiceberg crossed 100k daily downloads on PyPi (
https://pypistats.org/packages/pyiceberg)

Community
* Several reoccurring community meetups have started in
Seattle/SF/Singapore and more are planned
* Meetup presentations are recorded and available on the Apache Iceberg
Meetup YouTube channel (https://www.youtube.com/@IcebergMeetup)

Best,
Kevin Liu

On Wed, Dec 11, 2024 at 10:30 PM Péter Váry 
wrote:

> Hi Ryan,
> Thanks for putting this together!
> For Java/Flink we could mention that ExpireSnapshots TableMaintenance is
> available now.
>
> On Thu, Dec 12, 2024, 04:47 Ajantha Bhat  wrote:
>
>> At Java side, I would add
>>
>> - Core util to compute partition stats has been merged.
>> https://github.com/apache/iceberg/pull/11146
>>
>> - REST catalog TCK has been merged and docker image is published under `
>>
>> *apache/iceberg-rest-fixture`*Also,
>>
>>>  Spark: Removed Spark 3.3 support
>>
>> We just deprecated it after 1.7.0, so there will be one last release in
>> 1.8.0.
>> Maybe we can reword as Deprecated instead of removed as the code is still
>> there.
>>
>> - Ajantha
>>
>> On Thu, Dec 12, 2024 at 7:25 AM Renjie Liu 
>> wrote:
>>
>>> For rust, we have added support parquet data file writer, and support
>>> for other writers are undergoing.
>>>
>>>
>>> On Thu, Dec 12, 2024 at 9:26 AM Gang Wu  wrote:
>>>
>>>> For C++, I think it is aimed for a full featured C++ library (not for
>>>> puffin implementation only).
>>>>
>>>> On Thu, Dec 12, 2024 at 6:14 AM rdb...@gmail.com 
>>>> wrote:
>>>>
>>>>> I'll update it. Thanks!
>>>>>
>>>>> (By the way, the Avro default value support was in the Java section)
>>>>>
>>>>> On Wed, Dec 11, 2024 at 2:00 PM Matt Topol 
>>>>> wrote:
>>>>>
>>>>>> For the Go release, can we please point out that it supports reading
>>>>>> the data too, not just metadata?
>>>>>>
>>>>>> It produces a stream of Arrow record batches.
>>>>>>
>>>>>> On Wed, Dec 11, 2024, 4:22 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> For Table Format V3, we could point out that the default value
>>>>>>> support for Avro has been merged and support for other formats is 
>>>>>>> ongoing.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 11, 2024 at 12:51 PM rdb...@gmail.com 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> It’s time to report to the board again. Great to see all the
>>>>>>>> progress here, and awesome to have our first go release this quarter!
>>>>>>>>
>>>>>>>> My draft is below. Please reply if there’s anything you’d like to
>>>>>>>> add or change. Thanks!
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>> Description:
>>>>>>>>
>>>>>>>> Apache Iceberg is a table format for huge analytic datasets that is
>>>>>>>> designed
>>>>>>>> for high performance and ease of use.
>>>>>>>> Project Status:
>>>>>>>>
>>>>>>>> Current project status: Ongoing
>>>>>>>> Issues for the board: None
>>>>>>>> Membership Data:
>>>>>>>>
>>>>>>>> Apache Iceberg was founded 2020-05-19 (5 years ago)
>>>>>>>> There are currently 32 committers and 21 PMC members in this
>>>>>>>> project.
>>>>>>&

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Kevin Liu

Awesome! Thanks, Sung! :)

On Fri, Nov 22, 2024 at 9:16 AM Fokko Driesprong  wrote:

> I think Sung beat you to it: https://github.com/apache/iceberg/pull/11632
>
> As mentioned earlier it would be awesome if we could have a nightly build
> so we can test all the different languages against the nightly. In this
> case, when there are changes or new features, we can test/implement them
> right away.
>
> Kind regards,
> Fokko
>
> Op vr 22 nov 2024 om 18:11 schreef Kevin Liu :
>
>> Thanks for setting this up, JB! It looks like PR #11283
>> <https://github.com/apache/iceberg/pull/11283> is close to being merged.
>>
>> What is the deployment strategy for the Docker image? Ideally, this
>> process could be fully automated using GitHub and GitHub Actions.
>>
>> I’d love to hear everyone’s thoughts on this!
>>
>> Best regards,
>> Kevin Liu
>>
>>
>> On Fri, Nov 22, 2024 at 6:06 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I created the iceberg repo on DockerHub (in the Apache org):
>>>
>>> https://hub.docker.com/r/apache/iceberg
>>>
>>> I created an "Iceberg team" on DockerHub.
>>>
>>> I created DOCKERHUB_USER and DOCKERHUB_TOKEN credentials for the
>>> Iceberg repo. That will allow us to directly push on DockerHub repo
>>> from GitHub Action.
>>> I also added Fokko to the repo.
>>>
>>> If you are a committer and you want to get permission on the Iceberg
>>> DockerHub repo, please let me know, I will add your DockerHub account
>>> to the "iceberg team".
>>>
>>> Thanks !
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Nov 15, 2024 at 7:39 PM Kevin Liu 
>>> wrote:
>>> >
>>> > +1 to Iceberg REST TCK docker image. Thanks, JB for driving this and
>>> Ajantha for setting up the docker image.
>>> > We already found a bug in PyIceberg [1] from integrating with the TCK
>>> docker image. It would be great to have a nightly build, perhaps we can set
>>> up a Github Action to automate the docker image publishing.
>>> >
>>> > Best,
>>> > Kevin Liu
>>> >
>>> >
>>> > [1] https://github.com/apache/iceberg-python/pull/1321
>>> >
>>> > On Fri, Nov 15, 2024 at 1:36 AM Fokko Driesprong 
>>> wrote:
>>> >>
>>> >> +1 — excited to see this happen!
>>> >>
>>> >> For the TCK, I think we can release this with the Java together, and
>>> have a nightly build (tag the container with nightly Dockerhub). This way
>>> we can already test out (and start implementing) the new features in the
>>> related projects. Thoughts on that?
>>> >>
>>> >>> Regarding the Kafka Connect Docker image, I believe that if we
>>> maintain it, we could also manage other integration images, such as those
>>> for Spark and Trino with Iceberg. We should have a separate discussion on
>>> which integration images Iceberg should officially support.
>>> >>
>>> >>
>>> >> Let's split out that discussion. My take on that is that we want to
>>> defer that to the query engines. In an ideal situation, the Iceberg
>>> integration should be part of the project itself (e.g. with Hive 4 where it
>>> is maintained by Hive itself). For Spark itself, it only requires a runtime
>>> to be added through the packages argument, and would love to see if we can
>>> avoid maintaining images for that.
>>> >>
>>> >> Kind regards,
>>> >> Fokko
>>> >>
>>> >>
>>> >> Op do 14 nov 2024 om 18:16 schreef Christian Thiel <
>>> christ...@hansetag.com.invalid>:
>>> >>>
>>> >>> +1 for this as well – for us especially the REST TCK image would be
>>> nice.
>>> >>>
>>> >>>
>>> >>>
>>> >>> From: Bryan Keller 
>>> >>> Date: Thursday, 14. November 2024 at 17:13
>>> >>> To: dev@iceberg.apache.org 
>>> >>> Subject: Re: [PROPOSAL] Create Iceberg DockerHub repository
>>> >>>
>>> >>> +1 this would be great! Thanks JB.
>>> >>>
>>> >>>
>>> >>>
>>> >>> -Bryan
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Nov 14, 202

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Kevin Liu

> Should add, my personal preference is probably not to change the existing
behavior for this part

+1. I realized that this is not a new behavior. The `loadTable`
implementation has this problem too.
It would be good to have a test case specifically for this edge case and
maybe call this out in the documentation.

Thanks,
Kevin Liu

On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho  wrote:

> Should add, my personal preference is probably not to change the existing
> behavior for this part (false, if exists a Hive table with same name) at
> the moment, just adding another possibility for consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho  wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>> third option instead of returning true/false in this case, is to change it
>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
>> think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then confuse the
>>> opposite use case of this function. Let's assume that we change the
>>> implementation of tableExists() to not load the table internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you try to load it it
>>> fails because it's not an Iceberg table. I don't think that any of these 2
>>> are intuitive. I think the question here is how much an API of the Iceberg
>>> table format should know about the existence of tables in other formats.
>>>
>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>
>>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>>> the 2 approaches are better than the other, I just wanted to show that
>>> there is another side of the coin :)
>>>
>>> Regards,
>>> Gabor
>>>
>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>>> wrote:
>>>
>>>> Hi Steve,
>>>>
>>>> This makes sense to me. The semantics of `tableExists` focus on whether
>>>> a table's name exists in the catalog. For the Hive catalog, checking the
>>>> HMS entry should be sufficient.
>>>>
>>>> I do have a question about usage, though. Typically, I would use `
>>>> tableExists` like this:
>>>>
>>>> ```
>>>> if (!tableExists(table_name)) {
>>>> table = createTable(table_name);
>>>> }
>>>> ```
>>>> What happens when a Hive table with the same name already exists in the
>>>> catalog? In the current implementation, `tableExists` would return `false`
>>>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>> `NoSuchTableException`.
>>>> This would cause the code above to attempt to create the table, only to
>>>> fail since the name already exists in the HMS.
>>>> If `tableExists` is meant to check for conflicting entries in the HMS,
>>>> perhaps it should return true even when a Hive table with the same name
>>>> exists.
>>>>
>>>> I’d love to hear your thoughts on this.
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It's a good performance find and improvement.   Left some comment on
>>>>> the PR.
>>>>>
>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>> supportive of it.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>  wrote:
>>>>>
>>>>>> Hi Iceberger,
>>>>>>
>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>> thoughts.
>>>>>>
>>>>>>   Currently, in our catalog interface[1],

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-21 Thread Kevin Liu

Hi Steve,

This makes sense to me. The semantics of `tableExists` focus on whether a
table's name exists in the catalog. For the Hive catalog, checking the HMS
entry should be sufficient.

I do have a question about usage, though. Typically, I would use `
tableExists` like this:

```
if (!tableExists(table_name)) {
table = createTable(table_name);
}
```
What happens when a Hive table with the same name already exists in the
catalog? In the current implementation, `tableExists` would return `false`
because `HiveOperationsBase.validateTableIsIceberg` throws a
`NoSuchTableException`.
This would cause the code above to attempt to create the table, only to
fail since the name already exists in the HMS.
If `tableExists` is meant to check for conflicting entries in the HMS,
perhaps it should return true even when a Hive table with the same name
exists.

I’d love to hear your thoughts on this.

Best,
Kevin Liu

On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho  wrote:

> Hi,
>
> It's a good performance find and improvement.   Left some comment on the
> PR.
>
> IMO, the behavior actually more matches the API javadoc ("Check whether
> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>
> Thanks
> Szehon
>
> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>  wrote:
>
>> Hi Iceberger,
>>
>>   I have a proposal to simplify the tableExists API in the Hive catalog,
>> which involves a behavior change, and I’d like to hear your thoughts.
>>
>>   Currently, in our catalog interface[1], the tableExists method is
>> implemented as a default API by invoking the loadTable method. It
>> returns true if the table can be loaded without exceptions. This
>> behavior implies two checks:
>>
>>1. The table entry exists in the catalog.
>>2. The latest metadata.json for the table is not corrupted.
>>
>>   The behavior change I’m proposing focuses only on the first
>> condition—checking if the table entry exists in the catalog. This separates
>> the concerns of table existence and table health (e.g., metadata not
>> corrupted). Such a change could improve the performance of existence
>> checks, especially for RESTcatalog where table existence is abstracted as
>> an HTTP HEAD request [2].
>>
>> I also reviewed the current usage of the tableExists API in the Iceberg
>> codebase to ensure that this optimization would not have any negative
>> impact.
>>
>> I’d love to hear everyone’s feedback on this! If there’s consensus, I can
>> follow up with a similar optimization for the viewExists method in the
>> Hive catalog.
>>
>> [1]: https://github.com/apache/iceberg/pull/11597
>>
>> [2]:
>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>
>>
>>
>> Best regards,
>> Steve Zhang
>>
>>
>>
>>

1 2 >

1 - 100 of 148 matches

Mail list logo