date:20241127

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Péter Váry

+1 from my side too.

I wanted to make sure that the community is aware of this change which will
bring behavioral difference compared to other catalogs. This is why I have
asked Steve to start this thread.


On Thu, Nov 28, 2024, 02:10 Szehon Ho  wrote:

> Yea, I think that part is definitely kept.
>
> Thanks
> Szehon
>
> On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com 
> wrote:
>
>> I'd support changing the behavior if we still have a way to match the
>> intent, which is to return true if the table exists in Hive and is an
>> Iceberg table.
>>
>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho 
>> wrote:
>>
>>> Hm I think the thread got a bit sidetracked by the other question.
>>>
>>> The initial proposal by Steve is a performance improvement for
>>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>>> table metadata, and if successful returns true.  The proposal is to load
>>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>>> table exists with this name.
>>>
>>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>>> the current behavior, but would not anymore with the proposed change.
>>> That's the question of the original thread, and so far there's agreement
>>> that it is not necessarily part of this scope of HiveCatalog's
>>> tableExists().
>>>
>>> At least this is my understanding.
>>> Thanks,
>>> Szehon
>>>
>>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com 
>>> wrote:
>>>
 What kind of corruption are you referring to? I would expect corruption
 to result in an exception when loading the table, but that the table should
 still exist. The problem is likely that we determine if a table exists by
 attempting to load it. We could fix that by not attempting to load the
 table. I think that's a reasonable solution.

 On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
 wrote:

> The current behavior's intent is not to check whether the metadata is
>> valid, it is to detect whether the table is an Iceberg table.
>
>
> Is there a way to detect this from HiveCatalog without loading the
> table?
>
>
> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
>
>> I think we have an agreement, not to change the behavior wrt existing
>> non-Iceberg tables, and throw an exception.
>>
>> Are we also in agreement with the original proposal to return true
>> when the table exists but the metadata is somehow corrupted? Note: this 
>> is
>> the proposed change of behavior why the thread was originally started.
>>
>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com 
>> wrote:
>>
>>> I'd argue against changing this. The current behavior's intent is
>>> not to check whether the metadata is valid, it is to detect whether the
>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing that
>>> behavior would be surprising, especially if we started throwing 
>>> exceptions.
>>>
>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
>>> wrote:
>>>
 > Should add, my personal preference is probably not to change the
 existing behavior for this part

 +1. I realized that this is not a new behavior. The `loadTable`
 implementation has this problem too.
 It would be good to have a test case specifically for this edge
 case and maybe call this out in the documentation.

 Thanks,
 Kevin Liu

 On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
 wrote:

> Should add, my personal preference is probably not to change the
> existing behavior for this part (false, if exists a Hive table with 
> same
> name) at the moment, just adding another possibility for 
> consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
> wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I
>> guess a third option instead of returning true/false in this case, 
>> is to
>> change it to throw an NoSuchIcebergTableException if its a 
>> non-Iceberg
>> table, which I think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then
>>> confuse the opposite use case of this function. Let's assume that 
>>> we change
>>> the implementation of tableExists() to not load the table 
>>> internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you t

Re: [DISCUSS] Hive Support

2024-11-27 Thread Péter Váry

Given that the Hive folks also leaning towards keeping the hive-runtime
code in the Hive repo, I think we should move forward as Cheng Pan
suggested:
- Upgrade to Hive 4
- Remove hive-runtime code and tests
- Make sure that a nightly build is available, so Hive folks could run
integration tests, and could raise an issue if something breaks with the
integration

Thanks, Peter

On Thu, Nov 28, 2024, 06:46 Ajantha Bhat  wrote:

> +1 to remove support for both Hive2 and Hive3 in the latest Iceberg
> release as it has reached EOL.
>
> Hive4 is natively managing Iceberg integration, similar to how Trino
> handles its Iceberg integration. Therefore, in my opinion, it would be
> better for engines to manage the integration aspect, allowing the Iceberg
> community to focus on the specification and table format.
>
> - Ajantha
>
> On Thu, Nov 28, 2024 at 12:47 AM Fokko Driesprong 
> wrote:
>
>> Hey Cheng,
>>
>> Thanks for the suggestion. The nightly snapshots are available:
>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
>> which might help when working on features that are not released yet (eg
>> Nanosecond timestamps). Besides that, we should run RCs against Hive to
>> check if everything works as expected.
>>
>> I'm leaning toward removing Hive 2 and 3 as well.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com :
>>
>>> I think that we should remove Hive 2 and Hive 3. We already agreed to
>>> remove Hive 2, but Hive 3 is not compatible with the project anymore and is
>>> already EOL and will not see a release to update it so that it can be
>>> compatible. Anyone using the existing Hive 3 support should be able to
>>> continue using older releases.
>>>
>>> In general, I think it's a good idea to let people use older releases
>>> when these situations happen. It is difficult for the project to continue
>>> to support libraries that are EOL and I don't think there's a great
>>> justification for it, considering Iceberg support in Hive 4 is native and
>>> much better!
>>>
>>> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan  wrote:
>>>
 That said, it would be helpful if they continue running
 tests against the latest stable Hive releases to ensure that any
 changes don’t unintentionally break something for Hive, which would be
 beyond our control.


 I believe we should continue maintaining a Hive Iceberg runtime test
 suite with the latest version of Hive in the Iceberg repository.


 i think we can keep some basic Hive4 tests in iceberg repo


 Instead of running basic tests on the Iceberg repo, maybe let Iceberg
 publish daily snapshot jars to Nexus, and have a daily CI in Hive to
 consume those jars and run full Iceberg tests makes more sense?

 Thanks,
 Cheng Pan

Re: [DISCUSS] Apache Iceberg Summit 2025 - Selection Committee

2024-11-27 Thread Eduard Tudenhöfner

Thanks for organizing this and I'd like to volunteer to help out where I
can.

On Wed, Nov 27, 2024 at 9:16 AM Christian Thiel
 wrote:

> Hey JB,
>
> happy to help any way I can. Thanks for organizing this!
>
> Best,
> Christian
>
> On 27. Nov 2024, at 07:52, Fokko Driesprong  wrote:
>
> Hey JB,
>
> Thanks for organizing this. Happy to help!
>
> Kind regards,
> Fokko
>
> Op wo 27 nov 2024 om 06:23 schreef karuppayya :
>
>> Hi JB, I am happy to help with this.
>> - Karuppayya
>>
>> On Tue, Nov 26, 2024 at 8:55 PM Renjie Liu 
>> wrote:
>>
>>> Hi, JB:
>>>
>>> Thanks for driving this. Happy to help!
>>>
>>> On Wed, Nov 27, 2024 at 9:13 AM Bill Zhang 
>>> wrote:
>>>
 Hi JB,

 Happy to help.

 Bill

 > On Nov 26, 2024, at 4:42 AM, Jean-Baptiste Onofré 
 wrote:
 >
 > Hi everyone,
 >
 > As you probably know, we've been having discussions about the Iceberg
 > Summit 2025.
 >
 > The PMC pre-approved the Iceberg Summit proposal, and one of the first
 > steps is to put together a selection committee that will be
 > responsible for choosing talks and guiding the process.
 > Once we have a selection committee, I will complete the concrete
 > proposal for the ASF and the Iceberg PMC to request the ability to use
 > the name Iceberg/Apache Iceberg.
 >
 > If you'd like to help and be part of the selection committee, please
 > volunteer in a reply to this thread. Since we likely can't include
 > everyone that volunteers, I propose that the PMC should choose the
 > final committee from the set of people that volunteer.
 >
 > We'll leave this open up to Dec 10th to give people time (as
 > Thanksgiving is this week).
 >
 > Thanks !
 > Regards
 > JB

>>>
>

Re: [DISCUSS] Enforce table properties at catalog level

2024-11-27 Thread Pucheng Yang

I think the naming of the property should be fixed as it only applies for
any new table creation.

On Wed, Nov 27, 2024 at 2:21 AM Manu Zhang  wrote:

> Hi all,
>
> Currently, we can *enforce default table properties* at catalog level
> with configs like
> spark.sql.catalog.*catalog-name*.table-override.*propertyKey*[1].  It
> prevents users from overriding those properties when creating a table.
> However, users can still override later through altering the table.
> The Spark doc is inconsistent saying that the table-override property
> can't be overridden by user. Which one is expected?
>
>
> 1. 
> https://iceberg.apache.org/docs/nightly/spark-configuration/#catalog-configuration
> 
>
>
> Thanks,
> Manu
>

Re: Storing catalog directly on object store

2024-11-27 Thread Steve Loughran

There's a PR up from amazon to add this to the s3a connector
https://github.com/apache/hadoop/pull/7011

targeting a 3.4.2 release early next year, though they've not updated the
PR as requested yet.


   1. It doesn't give you the same semantics as posix create-no-overwrite
   call -you only get the error after the upload, not in create(). You should
   only be writing a very small file as part of your commit protocol, not
   something big.
   2. Most( all) third party stores *do not* support this -but they don't
   fail with any errors. The only way to probe for the behaviour is actually
   to attempt to do it and see if overwrites are rejected





On Tue, 26 Nov 2024 at 17:36, Nikhil Benesch 
wrote:

> Hi all,
>
> With Amazon S3 announcing support for the If-Match header yesterday [0],
> all the
> major object store implementations now support a compare-and-swap
> operation.
>
> As far as I can tell, this opens up the possibility of storing Iceberg
> catalogs directly on object storage, without the need for a separate
> metastore,
> and without violating any of Iceberg's ACID guarantees.
>
> It seems the immediate next step is to build an independent Java or REST
> catalog
> backend to prove this concept out. Long term, though, the ideal would be to
> have such a catalog backend be a first class citizen in the Iceberg
> project.
>
> Is anyone else in the Iceberg community barking up this tree? I'm a long
> term
> Iceberg enthusiast, but new to the community. I'd very much appreciate any
> pointers to current or past discussions on the topic. So far all I've been
> able to turn up is some light chatter from myself and others on Bluesky and
> Hacker News ([1][2][3]).
>
> Cheers,
> Nikhil
>
> [0]:
> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
> [1]: https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
> [2]: https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
> [3]: https://news.ycombinator.com/item?id=42240370
>

Re: [DISCUSS] Hive Support

2024-11-27 Thread Ayush Saxena

> Let me know if the above doesn't make any sense, though!

To be honest, it doesn’t. The email feels accusatory, unfairly blaming
the Hive community for wrongdoing while portraying the Iceberg folks
as "worse" and insinuating misconduct on their part. This kind of tone
does nothing to foster consensus in an open-source community.

In the past, when issues arose, we discussed them, highlighted the
problems, proposed solutions, reached agreements, and moved forward.
Now, there’s another problem, and once again, some folks, acting in
good faith, have shown their willingness to negotiate and find a
solution that works for both Hive and Iceberg. This is how meaningful
collaboration happens and how consensus is built—at least according to
my limited experience with the Apache opensource community.

Anyone genuinely invested in resolving these challenges should work
towards solutions that are practical and acceptable to both sides, as
we—the Hive contributors—already did. If someone prefers to use this
thread to express individual frustrations instead of contributing
constructively, that’s their prerogative, but it should be clearly
stated as such, maybe better to have a separate thread for that with a
clear note, that this isn’t “our” opinion but their individual
opinions.

All of the current Hive-Iceberg developers have already participated
in this thread. It’s now up to the Iceberg community to consider the
points raised. As always, we remain available to collaborate and
assist in finding workable solutions.

-Ayush

On Wed, 27 Nov 2024 at 20:15, Ayush Saxena  wrote:
>
> > Let me know if the above doesn't make any sense, though!
>
> To be honest, it doesn’t. The email feels accusatory, unfairly blaming
> the Hive community for wrongdoing while portraying the Iceberg folks
> as "worse" and insinuating misconduct on their part. This kind of tone
> does nothing to foster consensus in an open-source community.
>
> In the past, when issues arose, we discussed them, highlighted the
> problems, proposed solutions, reached agreements, and moved forward.
> Now, there’s another problem, and once again, some folks, acting in
> good faith, have shown their willingness to negotiate and find a
> solution that works for both Hive and Iceberg. This is how meaningful
> collaboration happens and how consensus is built—at least according to
> my limited experience with the Apache opensource community.
>
> Anyone genuinely invested in resolving these challenges should work
> towards solutions that are practical and acceptable to both sides, as
> we—the Hive contributors—already did. If someone prefers to use this
> thread to express individual frustrations instead of contributing
> constructively, that’s their prerogative, but it should be clearly
> stated as such, maybe better to have a separate thread for that with a
> clear note, that this isn’t “our” opinion but their individual
> opinions.
>
> All of the current Hive-Iceberg developers have already participated
> in this thread. It’s now up to the Iceberg community to consider the
> points raised. As always, we remain available to collaborate and
> assist in finding workable solutions.
>
> -Ayush
>
> On Wed, 27 Nov 2024 at 19:38, Denys Kuzmenko  wrote:
> >
> > Hi Gabor,
> >
> > It's a bit odd to get the following feedback from the Impala folks:
> > "I'd like to understand the motivation why this whole replication of code 
> > happened between Iceberg and Hive."
> > when you know exactly why.
> >
> > FYI, we've raised our concerns multiple times to the iceberg community, for 
> > example:
> > https://lists.apache.org/thread/kb543hmpxllgq16zgh0zwf03q4w78yop
> >
> > Regards,
> > Denys

Re: [DISCUSS] iceberg rust 0.4.0 and iceberg pyiceberg_core 0.1.0 release

2024-11-27 Thread Sung Yun

Hi folks, it's been some time since we've done an Iceberg Rust release, and 
we've finally set up the ghactions workflow[1] that will allow us to build and 
publish an abi3 compatible wheel to Pypi.

If we are still +1 for the release (both iceberg-rust and pyiceberg_core), I 
think it'll be awesome to get this release out soon as it will help the 
PyIceberg community test out the pyiceberg_core binding in preparation for the 
next release.

Another option would be to introduce a workflow_dispatch trigger to the 
python_release.yml and run a decoupled, release for pyiceberg_core[2]

I'd be happy to help run the release, if no one has started looking into it 
already.

Sung

[1] https://github.com/apache/iceberg-rust/pull/705
[2] https://lists.apache.org/thread/j22o7yktrlddrgkcy7gl88o23nyrgooc

On 2024/09/05 14:06:10 xianjin wrote:
> +1 for this pyiceberg_core as well.
> 
>   
> 
> Two cents about the iceberg-rust release schedule: it seems too aggressive to
> release by 2 weeks, monthly(4 weeks) release would be a nice fit.  
> 
> Sent from my iPhone
> 
>   
> 
> > On Sep 5, 2024, at 8:25 PM, Sung Yun  wrote:  
> >  
> >
> 
> > 
> >
> > Thank you for driving this Xuanwo!
> >
> >  
> >
> >
> > +1 as well, as noted the 0.1.0 pyiceberg_core release will allow PyIceberg
> > to begin integrating with the rust based core and introduce a new feature
> > that the community is looking for.
> >
> >  
> >
> >
> > On Thu, Sep 5, 2024 at 6:05 AM Renjie Liu
> > <[liurenjie2...@gmail.com](mailto:liurenjie2...@gmail.com)> wrote:  
> >
> >
> 
> >> +1 for this release.  
> >
> >>
> 
> >>  
> >
> >>
> 
> >> As iceberg-rust is under fast development, a shorter release (3-4 weeks)
> schedule would benefit users so that they don't need to rely on a snapshot
> version.
> 
> >>
> 
> >>  
> >
> >>
> 
> >> On Thu, Sep 5, 2024 at 3:26 PM Xuanwo
> <[xua...@apache.org](mailto:xua...@apache.org)> wrote:  
> >
> >>
> 
> >>> Hello, everyone  
> >  
> >  I'm starting this thread to discuss the release of iceberg rust 0.4.0 and
> > iceberg pyiceberg_core 0.1.0.  
> >  
> >  There is no specific reason for this release. I just want to align with the
> > two- to three-week release schedule of iceberg rust so users don't have to
> > wait long or encounter too many breaking changes at once.  
> >  
> >  Additionally, the pyiceberg team is awaiting our first release of
> > pyiceberg_core 0.1.0 so they can integrate with it, see how it works, and
> > explore ways to improve collaboration.  
> >  
> >  What do you think?  
> >  
> >  Xuanwo  
> >  
> >    
> >
> 
>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread rdb...@gmail.com

I'd support changing the behavior if we still have a way to match the
intent, which is to return true if the table exists in Hive and is an
Iceberg table.

On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho  wrote:

> Hm I think the thread got a bit sidetracked by the other question.
>
> The initial proposal by Steve is a performance improvement for
> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
> table metadata, and if successful returns true.  The proposal is to load
> from Hive only, and return true if Hive metadata identifies that an Iceberg
> table exists with this name.
>
> Checking corruption of Iceberg's table metadata.json is a side-effect of
> the current behavior, but would not anymore with the proposed change.
> That's the question of the original thread, and so far there's agreement
> that it is not necessarily part of this scope of HiveCatalog's
> tableExists().
>
> At least this is my understanding.
> Thanks,
> Szehon
>
> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com 
> wrote:
>
>> What kind of corruption are you referring to? I would expect corruption
>> to result in an exception when loading the table, but that the table should
>> still exist. The problem is likely that we determine if a table exists by
>> attempting to load it. We could fix that by not attempting to load the
>> table. I think that's a reasonable solution.
>>
>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
>> wrote:
>>
>>> The current behavior's intent is not to check whether the metadata is
 valid, it is to detect whether the table is an Iceberg table.
>>>
>>>
>>> Is there a way to detect this from HiveCatalog without loading the
>>> table?
>>>
>>>
>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
>>> wrote:
>>>
 I think we have an agreement, not to change the behavior wrt existing
 non-Iceberg tables, and throw an exception.

 Are we also in agreement with the original proposal to return true when
 the table exists but the metadata is somehow corrupted? Note: this is the
 proposed change of behavior why the thread was originally started.

 On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com  wrote:

> I'd argue against changing this. The current behavior's intent is not
> to check whether the metadata is valid, it is to detect whether the table
> is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
> would be surprising, especially if we started throwing exceptions.
>
> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
> wrote:
>
>> > Should add, my personal preference is probably not to change the
>> existing behavior for this part
>>
>> +1. I realized that this is not a new behavior. The `loadTable`
>> implementation has this problem too.
>> It would be good to have a test case specifically for this edge case
>> and maybe call this out in the documentation.
>>
>> Thanks,
>> Kevin Liu
>>
>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>> wrote:
>>
>>> Should add, my personal preference is probably not to change the
>>> existing behavior for this part (false, if exists a Hive table with same
>>> name) at the moment, just adding another possibility for consideration.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>>> wrote:
>>>
 Thanks Kevin and Gabor, this is an interesting discussion.  I guess
 a third option instead of returning true/false in this case, is to 
 change
 it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
 which I think is actually what this pr does?

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
  wrote:

> Hey,
>
> I think what Kevin says makes sense. However, it would then
> confuse the opposite use case of this function. Let's assume that we 
> change
> the implementation of tableExists() to not load the table internally:
>
> if (tableExists(table_name)) {
> table = loadTable(table_name);
> }
>
> Here, you find that the table exists but when you try to load it
> it fails because it's not an Iceberg table. I don't think that any of 
> these
> 2 are intuitive. I think the question here is how much an API of the
> Iceberg table format should know about the existence of tables in 
> other
> formats.
>
> If `tableExists` is meant to check for conflicting entries in the
>> HMS
>
> Another interpretation of calling Catalog.tableExists() on an
> Iceberg API is instead "is there such an Iceberg table". TBH, not 
> sure if
> any of the 2 approaches are better than the other, I just wanted to 
> show
> that there is another side of the c

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-27 Thread Sung Yun

Hi JB and Kevin, thank you for jumping on the chore.

Here's one more PR to bump up the version in iceberg-rust: 
https://github.com/apache/iceberg-rust/pull/725

I assume this didn't show up in the grep.app search since it was recently merged

On 2024/11/26 22:22:36 Kevin Liu wrote:
> We merged the PR[1] to upgrade `upload-artifact` to V4. Thanks, Fokko for
> the review.
> 
> Best,
> Kevin Liu
> 
> [1] https://github.com/apache/iceberg-python/pull/1371
> 
> 
> On Mon, Nov 25, 2024 at 10:36 PM Jean-Baptiste Onofré 
> wrote:
> 
> > Hi Kevin
> >
> > I did a quick search and I have the same feedback as you: only
> > iceberg-python is impacted.
> >
> > Thanks for the PR !
> >
> > Regards
> > JB
> >
> > On Mon, Nov 25, 2024 at 9:03 PM Kevin Liu  wrote:
> > >
> > > Hey folks,
> > >
> > > I did a code search for both `actions/upload-artifact` and
> > `actions/download-artifact` in the related iceberg repos.
> > > *
> > https://grep.app/search?q=actions/upload-artifact%40v3&filter[repo.pattern][0]=apache/iceberg
> > > *
> > https://grep.app/search?q=actions/download-artifact&filter[repo.pattern][0]=apache/iceberg
> > >
> > > Only iceberg-python is affected. Here's the PR to update the relevant
> > action, https://github.com/apache/iceberg-python/pull/1371
> > >
> > > Best,
> > > Kevin Liu
> > >
> > > On Mon, Nov 25, 2024 at 10:36 AM Jacob Wujciak 
> > wrote:
> > >>
> > >> Hello Everyone!
> > >>
> > >> I am writing to inform you of the imminent removal of the v3 artifact
> > >> actions that was announced in [1]. Both actions/upload-artifact@v3*
> > >> and actions/download-artifact@v3* will stop working in 10 days, on
> > >> December 5, 2024! According to a quick code search this project is
> > >> using one of the actions with a v3 tag in at least one of its repos.
> > >>
> > >> There are breaking changes in the usage of the upload action that will
> > >> likely require changes other than bumping the version, please see [2].
> > >> Make sure to update your workflows in time to avoid disruptions!
> > >>
> > >> If you have any questions or need help with the transition I'd
> > >> recommend bui...@apache.org as the place to look for help.
> > >>
> > >> Regards
> > >> Jacob Wujciak-Jens (assignUser)
> > >>
> > >> [1]:
> > https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
> > >> [2]:
> > https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
> >
>

Re: Storing catalog directly on object store

2024-11-27 Thread Alex Merced

This is just a quick thought to put out there: If there will be a new
reimagining of a file system catalog, would it be worth adding a
multi-table layer on top?

*As a rough example:*

- At the TOP is a JSON file that is just a mapping of the table name to the
directory where VERSION-HINT would be found (this is so the file is only
updated when tables are created or dropped)
- Then Engine finds the directory and uses the VERSION-HINT like normal to
discover metadata and plan the scan

This way, you have a listing of all your tables, so you don't have to
re-register each table with each tool but still can avoid having to run a
full service on top for basic application

*Governance in this Type of Catalog:*

- You can group different tables into different JSON files/catalogs
- Then file access controls on the JSON file can be used as a simple way to
control user access to groups of tables


On Wed, Nov 27, 2024 at 8:27 AM Manu Zhang  wrote:

> I think one major issue with current HadoopCatalog is that there's no way
> to manage tables by name. If adding one metadata layer on top of it, we
> need to handle more consistency challenges.
>
> Manu
>
> On Wed, Nov 27, 2024 at 8:03 PM Gabor Kaszab 
> wrote:
>
>> Hi All,
>>
>> Xuanwo, I recall the reasoning against HadoopCatalog was the other way
>> around: even though it is safe to use on HDFS, it is unsafe on object
>> storage. I believe that this gap of functionalities of object stores seems
>> to go away, so for me HadoopCatalog would even make more sense now than
>> before. The name might not be straightforward as it's not just for Hadoop.
>>
>> Regards,
>> Gabor
>>
>>
>> On Wed, Nov 27, 2024 at 9:02 AM Xuanwo  wrote:
>>
>>> Hi
>>>
>>> I believe we still need to deprecate HadoopCatalog since the operation
>>> is still not safe on Hadoop. As raised by Jack Ye before, I suggest we
>>> consider having a StorageCatalog or ObjectStorageCatalog that can only be
>>> used with storage services supporting conditional writes. That would be a
>>> good approach.
>>>
>>> On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
>>> > Makes sense! I'd be eager to chat more about this but I'm afraid I
>>> won't be at
>>> > re:Invent. Maybe we plan to circle back after re:Invent, once we see
>>> what AWS
>>> > announces?
>>> >
>>> > On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré 
>>> wrote:
>>> >>
>>> >> Hi Nikhil
>>> >>
>>> >> Thanks for your message, very interesting.
>>> >>
>>> >> I think it would be great to involve the Polaris project here as well,
>>> >> as a REST Catalog implementation.
>>> >> The Polaris community is discussing storage/backend right now, so it
>>> >> would be the perfect timing to consider leveraging S3 conditional
>>> >> writes (as a plugin for instance first).
>>> >>
>>> >> I would be happy to connect and know more about your perspective
>>> about that.
>>> >>
>>> >> Thanks,
>>> >> Regards
>>> >> JB
>>> >>
>>> >> PS: I will be at AWS re:Invent next week, so maybe we can connect
>>> there.
>>> >>
>>> >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
>>> nikhil.bene...@gmail.com> wrote:
>>> >> >
>>> >> > Hi all,
>>> >> >
>>> >> > With Amazon S3 announcing support for the If-Match header yesterday
>>> [0], all the
>>> >> > major object store implementations now support a compare-and-swap
>>> operation.
>>> >> >
>>> >> > As far as I can tell, this opens up the possibility of storing
>>> Iceberg
>>> >> > catalogs directly on object storage, without the need for a
>>> separate metastore,
>>> >> > and without violating any of Iceberg's ACID guarantees.
>>> >> >
>>> >> > It seems the immediate next step is to build an independent Java or
>>> REST catalog
>>> >> > backend to prove this concept out. Long term, though, the ideal
>>> would be to
>>> >> > have such a catalog backend be a first class citizen in the Iceberg
>>> project.
>>> >> >
>>> >> > Is anyone else in the Iceberg community barking up this tree? I'm a
>>> long term
>>> >> > Iceberg enthusiast, but new to the community. I'd very much
>>> appreciate any
>>> >> > pointers to current or past discussions on the topic. So far all
>>> I've been
>>> >> > able to turn up is some light chatter from myself and others on
>>> Bluesky and
>>> >> > Hacker News ([1][2][3]).
>>> >> >
>>> >> > Cheers,
>>> >> > Nikhil
>>> >> >
>>> >> > [0]:
>>> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
>>> >> > [1]:
>>> https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
>>> >> > [2]:
>>> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
>>> >> > [3]: https://news.ycombinator.com/item?id=42240370
>>>
>>> --
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>

-- 

*Alex Merced  *
*Senior Tech Evangelist, Dremio **Dremio.com*
*/
**Follow Us on LinkedIn!*

Re: Storing catalog directly on object store

2024-11-27 Thread Alex Merced

Ignore the last email, just re-read the proposal earlier in the email chain

On Wed, Nov 27, 2024 at 11:37 AM Alex Merced  wrote:

> This is just a quick thought to put out there: If there will be a new
> reimagining of a file system catalog, would it be worth adding a
> multi-table layer on top?
>
> *As a rough example:*
>
> - At the TOP is a JSON file that is just a mapping of the table name to
> the directory where VERSION-HINT would be found (this is so the file is
> only updated when tables are created or dropped)
> - Then Engine finds the directory and uses the VERSION-HINT like normal to
> discover metadata and plan the scan
>
> This way, you have a listing of all your tables, so you don't have to
> re-register each table with each tool but still can avoid having to run a
> full service on top for basic application
>
> *Governance in this Type of Catalog:*
>
> - You can group different tables into different JSON files/catalogs
> - Then file access controls on the JSON file can be used as a simple way
> to control user access to groups of tables
>
>
> On Wed, Nov 27, 2024 at 8:27 AM Manu Zhang 
> wrote:
>
>> I think one major issue with current HadoopCatalog is that there's no way
>> to manage tables by name. If adding one metadata layer on top of it, we
>> need to handle more consistency challenges.
>>
>> Manu
>>
>> On Wed, Nov 27, 2024 at 8:03 PM Gabor Kaszab 
>> wrote:
>>
>>> Hi All,
>>>
>>> Xuanwo, I recall the reasoning against HadoopCatalog was the other way
>>> around: even though it is safe to use on HDFS, it is unsafe on object
>>> storage. I believe that this gap of functionalities of object stores seems
>>> to go away, so for me HadoopCatalog would even make more sense now than
>>> before. The name might not be straightforward as it's not just for Hadoop.
>>>
>>> Regards,
>>> Gabor
>>>
>>>
>>> On Wed, Nov 27, 2024 at 9:02 AM Xuanwo  wrote:
>>>
 Hi

 I believe we still need to deprecate HadoopCatalog since the operation
 is still not safe on Hadoop. As raised by Jack Ye before, I suggest we
 consider having a StorageCatalog or ObjectStorageCatalog that can only be
 used with storage services supporting conditional writes. That would be a
 good approach.

 On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
 > Makes sense! I'd be eager to chat more about this but I'm afraid I
 won't be at
 > re:Invent. Maybe we plan to circle back after re:Invent, once we see
 what AWS
 > announces?
 >
 > On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré 
 wrote:
 >>
 >> Hi Nikhil
 >>
 >> Thanks for your message, very interesting.
 >>
 >> I think it would be great to involve the Polaris project here as
 well,
 >> as a REST Catalog implementation.
 >> The Polaris community is discussing storage/backend right now, so it
 >> would be the perfect timing to consider leveraging S3 conditional
 >> writes (as a plugin for instance first).
 >>
 >> I would be happy to connect and know more about your perspective
 about that.
 >>
 >> Thanks,
 >> Regards
 >> JB
 >>
 >> PS: I will be at AWS re:Invent next week, so maybe we can connect
 there.
 >>
 >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
 nikhil.bene...@gmail.com> wrote:
 >> >
 >> > Hi all,
 >> >
 >> > With Amazon S3 announcing support for the If-Match header
 yesterday [0], all the
 >> > major object store implementations now support a compare-and-swap
 operation.
 >> >
 >> > As far as I can tell, this opens up the possibility of storing
 Iceberg
 >> > catalogs directly on object storage, without the need for a
 separate metastore,
 >> > and without violating any of Iceberg's ACID guarantees.
 >> >
 >> > It seems the immediate next step is to build an independent Java
 or REST catalog
 >> > backend to prove this concept out. Long term, though, the ideal
 would be to
 >> > have such a catalog backend be a first class citizen in the
 Iceberg project.
 >> >
 >> > Is anyone else in the Iceberg community barking up this tree? I'm
 a long term
 >> > Iceberg enthusiast, but new to the community. I'd very much
 appreciate any
 >> > pointers to current or past discussions on the topic. So far all
 I've been
 >> > able to turn up is some light chatter from myself and others on
 Bluesky and
 >> > Hacker News ([1][2][3]).
 >> >
 >> > Cheers,
 >> > Nikhil
 >> >
 >> > [0]:
 https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
 >> > [1]:
 https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
 >> > [2]:
 https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
 >> > [3]: https://news.ycombinator.com/item?id=42240370

 --
 Xuanwo

 https://xuanwo.io/

>>>
>
> --
>
>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread rdb...@gmail.com

What kind of corruption are you referring to? I would expect corruption to
result in an exception when loading the table, but that the table should
still exist. The problem is likely that we determine if a table exists by
attempting to load it. We could fix that by not attempting to load the
table. I think that's a reasonable solution.

On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang  wrote:

> The current behavior's intent is not to check whether the metadata is
>> valid, it is to detect whether the table is an Iceberg table.
>
>
> Is there a way to detect this from HiveCatalog without loading the table?
>
>
> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
> wrote:
>
>> I think we have an agreement, not to change the behavior wrt existing
>> non-Iceberg tables, and throw an exception.
>>
>> Are we also in agreement with the original proposal to return true when
>> the table exists but the metadata is somehow corrupted? Note: this is the
>> proposed change of behavior why the thread was originally started.
>>
>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com  wrote:
>>
>>> I'd argue against changing this. The current behavior's intent is not to
>>> check whether the metadata is valid, it is to detect whether the table is
>>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>>> would be surprising, especially if we started throwing exceptions.
>>>
>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:
>>>
 > Should add, my personal preference is probably not to change the
 existing behavior for this part

 +1. I realized that this is not a new behavior. The `loadTable`
 implementation has this problem too.
 It would be good to have a test case specifically for this edge case
 and maybe call this out in the documentation.

 Thanks,
 Kevin Liu

 On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
 wrote:

> Should add, my personal preference is probably not to change the
> existing behavior for this part (false, if exists a Hive table with same
> name) at the moment, just adding another possibility for consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
> wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>> third option instead of returning true/false in this case, is to change 
>> it
>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, 
>> which I
>> think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then confuse
>>> the opposite use case of this function. Let's assume that we change the
>>> implementation of tableExists() to not load the table internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you try to load it it
>>> fails because it's not an Iceberg table. I don't think that any of 
>>> these 2
>>> are intuitive. I think the question here is how much an API of the 
>>> Iceberg
>>> table format should know about the existence of tables in other formats.
>>>
>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>
>>> Another interpretation of calling Catalog.tableExists() on an
>>> Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
>>> if
>>> any of the 2 approaches are better than the other, I just wanted to show
>>> that there is another side of the coin :)
>>>
>>> Regards,
>>> Gabor
>>>
>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>>> wrote:
>>>
 Hi Steve,

 This makes sense to me. The semantics of `tableExists` focus on
 whether a table's name exists in the catalog. For the Hive catalog,
 checking the HMS entry should be sufficient.

 I do have a question about usage, though. Typically, I would use `
 tableExists` like this:

 ```
 if (!tableExists(table_name)) {
 table = createTable(table_name);
 }
 ```
 What happens when a Hive table with the same name already exists in
 the catalog? In the current implementation, `tableExists` would return
 `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
 `NoSuchTableException`.
 This would cause the code above to attempt to create the table,
 only to fail since the name already exists in the HMS.
 If `tableExists` is meant to check for conflicting entries in the
 HMS, perhaps it should return true even when a Hive table with the same
 name exists.

 I’d love to hear your thoughts on this.

 Bes

Re: [DISCUSS] Hive Support

2024-11-27 Thread Fokko Driesprong

Hey Cheng,

Thanks for the suggestion. The nightly snapshots are available:
https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
which might help when working on features that are not released yet (eg
Nanosecond timestamps). Besides that, we should run RCs against Hive to
check if everything works as expected.

I'm leaning toward removing Hive 2 and 3 as well.

Kind regards,
Fokko

Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com :

> I think that we should remove Hive 2 and Hive 3. We already agreed to
> remove Hive 2, but Hive 3 is not compatible with the project anymore and is
> already EOL and will not see a release to update it so that it can be
> compatible. Anyone using the existing Hive 3 support should be able to
> continue using older releases.
>
> In general, I think it's a good idea to let people use older releases when
> these situations happen. It is difficult for the project to continue to
> support libraries that are EOL and I don't think there's a great
> justification for it, considering Iceberg support in Hive 4 is native and
> much better!
>
> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan  wrote:
>
>> That said, it would be helpful if they continue running
>> tests against the latest stable Hive releases to ensure that any
>> changes don’t unintentionally break something for Hive, which would be
>> beyond our control.
>>
>>
>> I believe we should continue maintaining a Hive Iceberg runtime test
>> suite with the latest version of Hive in the Iceberg repository.
>>
>>
>> i think we can keep some basic Hive4 tests in iceberg repo
>>
>>
>> Instead of running basic tests on the Iceberg repo, maybe let Iceberg
>> publish daily snapshot jars to Nexus, and have a daily CI in Hive to
>> consume those jars and run full Iceberg tests makes more sense?
>>
>> Thanks,
>> Cheng Pan
>>
>>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho

Hm I think the thread got a bit sidetracked by the other question.

The initial proposal by Steve is a performance improvement for
HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
table metadata, and if successful returns true.  The proposal is to load
from Hive only, and return true if Hive metadata identifies that an Iceberg
table exists with this name.

Checking corruption of Iceberg's table metadata.json is a side-effect of
the current behavior, but would not anymore with the proposed change.
That's the question of the original thread, and so far there's agreement
that it is not necessarily part of this scope of HiveCatalog's
tableExists().

At least this is my understanding.
Thanks,
Szehon

On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com  wrote:

> What kind of corruption are you referring to? I would expect corruption to
> result in an exception when loading the table, but that the table should
> still exist. The problem is likely that we determine if a table exists by
> attempting to load it. We could fix that by not attempting to load the
> table. I think that's a reasonable solution.
>
> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
> wrote:
>
>> The current behavior's intent is not to check whether the metadata is
>>> valid, it is to detect whether the table is an Iceberg table.
>>
>>
>> Is there a way to detect this from HiveCatalog without loading the table?
>>
>>
>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
>> wrote:
>>
>>> I think we have an agreement, not to change the behavior wrt existing
>>> non-Iceberg tables, and throw an exception.
>>>
>>> Are we also in agreement with the original proposal to return true when
>>> the table exists but the metadata is somehow corrupted? Note: this is the
>>> proposed change of behavior why the thread was originally started.
>>>
>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com  wrote:
>>>
 I'd argue against changing this. The current behavior's intent is not
 to check whether the metadata is valid, it is to detect whether the table
 is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
 would be surprising, especially if we started throwing exceptions.

 On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
 wrote:

> > Should add, my personal preference is probably not to change the
> existing behavior for this part
>
> +1. I realized that this is not a new behavior. The `loadTable`
> implementation has this problem too.
> It would be good to have a test case specifically for this edge case
> and maybe call this out in the documentation.
>
> Thanks,
> Kevin Liu
>
> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
> wrote:
>
>> Should add, my personal preference is probably not to change the
>> existing behavior for this part (false, if exists a Hive table with same
>> name) at the moment, just adding another possibility for consideration.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>> wrote:
>>
>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess
>>> a third option instead of returning true/false in this case, is to 
>>> change
>>> it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
>>> which I think is actually what this pr does?
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>  wrote:
>>>
 Hey,

 I think what Kevin says makes sense. However, it would then confuse
 the opposite use case of this function. Let's assume that we change the
 implementation of tableExists() to not load the table internally:

 if (tableExists(table_name)) {
 table = loadTable(table_name);
 }

 Here, you find that the table exists but when you try to load it it
 fails because it's not an Iceberg table. I don't think that any of 
 these 2
 are intuitive. I think the question here is how much an API of the 
 Iceberg
 table format should know about the existence of tables in other 
 formats.

 If `tableExists` is meant to check for conflicting entries in the
> HMS

 Another interpretation of calling Catalog.tableExists() on an
 Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
 if
 any of the 2 approaches are better than the other, I just wanted to 
 show
 that there is another side of the coin :)

 Regards,
 Gabor

 On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
 wrote:

> Hi Steve,
>
> This makes sense to me. The semantics of `tableExists` focus on
> whether a table's name exists in the catalog. For the Hive catalog,
> checking the HMS entry should b

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-11-27 Thread Szehon Ho

This makes sense to me generally, I've tried a few times to search in the
spec to find a list of possible snapshot summary properties, and was a bit
surprised to not find them there.  So I think this would be a nice addition.

I'm curious if there's any historical reason it's not been included in the
spec.

Thanks
Szehon

On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu  wrote:

> Thanks for driving this Honah!
>
> It's important to have a consistent naming scheme so that we don't need to
> worry about edge cases when using multiple engines, and possibly have to
> deal with migrations.
>
> Also, since users can store arbitrary key/value pairs in the summary
> property, it's good to document the currently used properties to avoid
> collision.
>
> I like the proposal to document all properties in a "snapshot summary"
> table, this will ensure a centralized place to view all possible key/value
> pairs, similar to how FileIO configuration is handled in iceberg-python
> . Other
> implementations can use this table as a reference.
>
>  > This approach offers flexibility, as new fields can be added through
> documentation updates without requiring specification changes.
> This will save a lot of effort since specification changes require
> greater scrutiny.
>
> > summary details would not be located near the Snapshot section, which
> explains the summary field.
> We can link the table to the Snapshot section.
>
>
> Would love to hear others' thoughts on this.
>
> Best,
> Kevin Liu
>
> On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:
>
>> Hi everyone,
>>
>> I’d like to propose an addition to the table specification to document
>> optional fields in the snapshot summary.
>>
>> Currently, the snapshot summary includes a required operation field and
>> various optional fields. While these optional fields—such as metrics and
>> partition-level summaries—are supported by Java
>> 
>> and Python
>> 
>> implementations, they are not officially documented. This creates risks of
>> inconsistency as other implementations and engines adopt and interact with
>> these fields.
>>
>> I propose adding a new section to the table specification to document
>> these optional fields, ensuring consistent naming conventions and reducing
>> ambiguity across implementations. While this is the primary proposal, it
>> may also be worth discussing whether documenting these fields separately in
>> Docs/Table would provide additional flexibility for future updates.
>>
>> I’d love to hear your thoughts, suggestions, or concerns about this
>> proposal.
>>
>> Looking forward to the discussion!
>>
>> Links
>>
>>- GitHub tracking issue:
>>https://github.com/apache/iceberg/issues/11659
>>- Proposal:
>>
>> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
>>- PR: https://github.com/apache/iceberg/pull/11660
>>
>>
>> Best regards,
>> Honah
>>
>

Re: [DISCUSS] Deprecate embedded manifests

2024-11-27 Thread Fokko Driesprong

I'd say emit deprecation warnings for a reasonable amount of time (at least
v2.0 of the Java implementations), including emitting warnings as shown in
the PR . This and then
remove the code path at some point. If you still have snapshots around with
manifests, then you should use an older version of Java (PyIceberg, Rust,
etc don't support it anyway).

Kind regards,
Fokko

Op wo 27 nov 2024 om 19:00 schreef rdb...@gmail.com :

> I think it's reasonable to mark it deprecated in the spec, especially
> because we don't allow it in v2. But I'm not sure how that would allow us
> to remove code paths associated with it. If it is allowed by an older and
> supported version of the spec, then how can we safely remove the code paths
> that read it?
>
> On Fri, Nov 22, 2024 at 2:56 AM Fokko Driesprong  wrote:
>
>> Hey Ryan,
>>
>> The goal of the deprecation is to avoid other implementations to produce
>> it. PyIceberg for example, does not support this and I think it would be
>> good to avoid having others (rust, go, etc) to support this. Regarding the
>> removal, Amogh expressed the same concern on the PR
>> .
>>
>> In my quest to make the Java implementation follow the spec as closely as
>> possible, I noticed that we use a DummyFileIO to mimic a ManifestList. I
>> ran into this when turning
>> 503:
>> added_snapshot_id
>>  into a
>> required field
>> . So the
>> value is in removing paths, as Shezon pointed out. When removing support
>> for the embedded manifest list, we can remove all that logic and keep the
>> codebase nice and tidy.
>>
>> It would be good to start the discussion of deprecating support for older
>> formats at some point, however, for a V2 reader is it fairly easy to
>> project V1 metadata as V2. Except when embedded manifests are being used,
>> marking this kind of oddities as deprecated I think will enable readers to
>> support reading older versions for a longer time. My suggestion would be to
>> mark the field as deprecated and revisit the actual removal. I've marked it
>> up for removal in Java 2.0 for now to give it enough time.
>>
>> Kind regards,
>> Fokko
>>
>>
>>
>> Op do 21 nov 2024 om 20:52 schreef rdb...@gmail.com :
>>
>>> Can we safely deprecate and remove this? The manifest list is required
>>> in v2, but the spec has stated for a long time that v1 tables can use
>>> manifests rather than a manifest list. It’s unlikely, but it would be
>>> valid for other implementations to produce it.
>>>
>>> I would understand if other implementations chose to fail tables that
>>> don’t have a manifest list to avoid adding code to handle manifests,
>>> but I don’t think that there’s much value in removing support from the Java
>>> implementation.
>>>
>>> Instead, what about discussing how to deprecate support for older format
>>> versions? That seems like the main issue here. Once the majority of
>>> implementations move to newer versions, we would like to deprecate the old
>>> ones.
>>>
>>> On Thu, Nov 21, 2024 at 11:01 AM Szehon Ho 
>>> wrote:
>>>
 +1, great to have less possible paths.

 Thanks
 Szehon

 On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang
  wrote:

> +1 to deprecate
>
> Thanks,
> Steve Zhang
>
>
>
> On Nov 19, 2024, at 3:32 AM, Fokko Driesprong 
> wrote:
>
> Hi everyone,
>
> I would like to propose to deprecate embedded manifests
> . This has been used
> before the manifest-list was introduced, but I don't think they are used
> since the project has been open-sourced, and it would be good to
> officially deprecate them from the spec. It is only supported by Iceberg
> Java today, and I haven't seen any requests for PyIceberg to add support
> for this.
>
> Any questions or concerns about deprecating the embedded manifests?
>
> Kind regards,
> Fokko Driesprong
>
>
>

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-27 Thread Kevin Liu

Thanks Sung. I assumed grep.app will continuously index all GitHub repos
but it seems to be missing a few.

For completeness, I went through the GitHub search feature, using
`org:apache` with both `upload-artifact@v3` and `download-artifact@v3`.
* https://github.com/search?q=org%3Aapache%20upload-artifact%40v3&type=code
* https://github.com/search?q=org%3Aapache+download-artifact%40v3&type=code

Looks like `iceberg-rust` is the only place we missed.

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 5:45 AM Sung Yun  wrote:

> Hi JB and Kevin, thank you for jumping on the chore.
>
> Here's one more PR to bump up the version in iceberg-rust:
> https://github.com/apache/iceberg-rust/pull/725
>
> I assume this didn't show up in the grep.app search since it was recently
> merged
>
> On 2024/11/26 22:22:36 Kevin Liu wrote:
> > We merged the PR[1] to upgrade `upload-artifact` to V4. Thanks, Fokko for
> > the review.
> >
> > Best,
> > Kevin Liu
> >
> > [1] https://github.com/apache/iceberg-python/pull/1371
> >
> >
> > On Mon, Nov 25, 2024 at 10:36 PM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Kevin
> > >
> > > I did a quick search and I have the same feedback as you: only
> > > iceberg-python is impacted.
> > >
> > > Thanks for the PR !
> > >
> > > Regards
> > > JB
> > >
> > > On Mon, Nov 25, 2024 at 9:03 PM Kevin Liu 
> wrote:
> > > >
> > > > Hey folks,
> > > >
> > > > I did a code search for both `actions/upload-artifact` and
> > > `actions/download-artifact` in the related iceberg repos.
> > > > *
> > >
> https://grep.app/search?q=actions/upload-artifact%40v3&filter[repo.pattern][0]=apache/iceberg
> > > > *
> > >
> https://grep.app/search?q=actions/download-artifact&filter[repo.pattern][0]=apache/iceberg
> > > >
> > > > Only iceberg-python is affected. Here's the PR to update the relevant
> > > action, https://github.com/apache/iceberg-python/pull/1371
> > > >
> > > > Best,
> > > > Kevin Liu
> > > >
> > > > On Mon, Nov 25, 2024 at 10:36 AM Jacob Wujciak <
> assignu...@apache.org>
> > > wrote:
> > > >>
> > > >> Hello Everyone!
> > > >>
> > > >> I am writing to inform you of the imminent removal of the v3
> artifact
> > > >> actions that was announced in [1]. Both actions/upload-artifact@v3*
> > > >> and actions/download-artifact@v3* will stop working in 10 days, on
> > > >> December 5, 2024! According to a quick code search this project is
> > > >> using one of the actions with a v3 tag in at least one of its repos.
> > > >>
> > > >> There are breaking changes in the usage of the upload action that
> will
> > > >> likely require changes other than bumping the version, please see
> [2].
> > > >> Make sure to update your workflows in time to avoid disruptions!
> > > >>
> > > >> If you have any questions or need help with the transition I'd
> > > >> recommend bui...@apache.org as the place to look for help.
> > > >>
> > > >> Regards
> > > >> Jacob Wujciak-Jens (assignUser)
> > > >>
> > > >> [1]:
> > >
> https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
> > > >> [2]:
> > > https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
> > >
> >
>

Re: [DISCUSS] iceberg rust 0.4.0 and iceberg pyiceberg_core 0.1.0 release

2024-11-27 Thread Kevin Liu

Thanks for driving this, Sung! I'm +1 to release both iceberg-rust and
pyiceberg_core. It's very exciting to see pyiceberg_core and its
integration with PyIceberg.
It makes sense to decouple pyiceberg_core from iceberg-rust since the two
"projects" are on different tracks. We'd want to release pyiceberg_core
features independent of iceberg-rust features.

Please let me know if there's anything I can do to help.

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 6:13 AM Sung Yun  wrote:

> Hi folks, it's been some time since we've done an Iceberg Rust release,
> and we've finally set up the ghactions workflow[1] that will allow us to
> build and publish an abi3 compatible wheel to Pypi.
>
> If we are still +1 for the release (both iceberg-rust and pyiceberg_core),
> I think it'll be awesome to get this release out soon as it will help the
> PyIceberg community test out the pyiceberg_core binding in preparation for
> the next release.
>
> Another option would be to introduce a workflow_dispatch trigger to the
> python_release.yml and run a decoupled, release for pyiceberg_core[2]
>
> I'd be happy to help run the release, if no one has started looking into
> it already.
>
> Sung
>
> [1] https://github.com/apache/iceberg-rust/pull/705
> [2] https://lists.apache.org/thread/j22o7yktrlddrgkcy7gl88o23nyrgooc
>
> On 2024/09/05 14:06:10 xianjin wrote:
> > +1 for this pyiceberg_core as well.
> >
> >
> >
> > Two cents about the iceberg-rust release schedule: it seems too
> aggressive to
> > release by 2 weeks, monthly(4 weeks) release would be a nice fit.
> >
> > Sent from my iPhone
> >
> >
> >
> > > On Sep 5, 2024, at 8:25 PM, Sung Yun  wrote:
> > >
> > >
> >
> > > 
> > >
> > > Thank you for driving this Xuanwo!
> > >
> > >
> > >
> > >
> > > +1 as well, as noted the 0.1.0 pyiceberg_core release will allow
> PyIceberg
> > > to begin integrating with the rust based core and introduce a new
> feature
> > > that the community is looking for.
> > >
> > >
> > >
> > >
> > > On Thu, Sep 5, 2024 at 6:05 AM Renjie Liu
> > > <[liurenjie2...@gmail.com](mailto:liurenjie2...@gmail.com)> wrote:
> > >
> > >
> >
> > >> +1 for this release.
> > >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> As iceberg-rust is under fast development, a shorter release (3-4
> weeks)
> > schedule would benefit users so that they don't need to rely on a
> snapshot
> > version.
> >
> > >>
> >
> > >>
> > >
> > >>
> >
> > >> On Thu, Sep 5, 2024 at 3:26 PM Xuanwo
> > <[xua...@apache.org](mailto:xua...@apache.org)> wrote:
> > >
> > >>
> >
> > >>> Hello, everyone
> > >
> > >  I'm starting this thread to discuss the release of iceberg rust 0.4.0
> and
> > > iceberg pyiceberg_core 0.1.0.
> > >
> > >  There is no specific reason for this release. I just want to align
> with the
> > > two- to three-week release schedule of iceberg rust so users don't
> have to
> > > wait long or encounter too many breaking changes at once.
> > >
> > >  Additionally, the pyiceberg team is awaiting our first release of
> > > pyiceberg_core 0.1.0 so they can integrate with it, see how it works,
> and
> > > explore ways to improve collaboration.
> > >
> > >  What do you think?
> > >
> > >  Xuanwo
> > >
> > >  
> > >
> >
> >
>

Re: [DISCUSS] Hive Support

2024-11-27 Thread Cheng Pan

> That said, it would be helpful if they continue running
> tests against the latest stable Hive releases to ensure that any
> changes don’t unintentionally break something for Hive, which would be
> beyond our control.

> I believe we should continue maintaining a Hive Iceberg runtime test suite 
> with the latest version of Hive in the Iceberg repository.


> i think we can keep some basic Hive4 tests in iceberg repo


Instead of running basic tests on the Iceberg repo, maybe let Iceberg publish 
daily snapshot jars to Nexus, and have a daily CI in Hive to consume those jars 
and run full Iceberg tests makes more sense?

Thanks,
Cheng Pan

Re: Storing catalog directly on object store

2024-11-27 Thread rdb...@gmail.com

> We deprecated this recently and we don't have to deprecate it if object
stores support atomic operations like this.

I disagree because this misses many of the reasons for deprecation. It
isn't just that S3 didn't support a `putIfAbsent` operation. Other object
stores did and there are still several problems with this approach. The
fundamental issue is that it is attempting to solve problems at the wrong
level.

One of the reasons why Iceberg exists is that we saw people doing the same
thing with Parquet. People were trying to solve problems with their tables
by attempting to modify Parquet in wacky ways, like wanting to replace
the footer to make schema changes. Schema evolution needed to be solved at
the table level and in this community we've always tried to solve problems
more directly and elegantly by addressing them at the right layer of the
stack.

Iceberg tables scale up existing atomic operations to make transactional
guarantees on very large tables. Object stores and file systems aren't well
suited for this task. Just like they were not sufficient to provide
transactional guarantees across files and partitions, the primitives you
can use aren't sufficient for a database. Storage capabilities are also not
the right place to deliver other catalog features, like basic CRUD
operations.

The addition of `putIfAbsent` to S3 doesn't support transactions where you
need to modify multiple tables and it also doesn't address cases like the
need to atomically rename and delete tables. Schemes that use `putIfAbsent`
also rely either on consistent listing a large prefix or on maintaining a
version-hint file. That version-hint file can be out of date, so even with
one you still need to list or iteratively attempt to read metadata files to
determine the latest.

Getting a file-only scheme right is complicated and is specific to a
particular store (both commits and version-hint handling). Local file
systems would use an exclusive create operation to commit, Hadoop uses
atomic rename, and object stores use different `putIfAbsent` operations.
Making this work across languages and engines requires a lot of work to
specify requirements and document issues, only to get to single-table
functionality that doesn't deliver the catalog-level primitives like atomic
rename that are commonly used.

In the end, catalog problems are best solved at the catalog layer, not
through an elaborate scheme that uses storage-layer primitives, just as it
was not a good idea to deliver table behaviors using similar storage-layer
schemes. Adding `putIfAbsent` to S3 doesn't change that design principle.

I sympathize with the idea that it would be great if you didn't need a
catalog. Simpler infrastructure is generally better.

But trying to avoid a catalog limits the capabilities of this
infrastructure, while setting people up for later failure. When I talk with
people that have been trying to avoid having a catalog, they tend to have
tables scattered across buckets that they need to track down, they lack
observability to know what is being used, don't to know if they are
deleting data in compliance with regulations, and they often lack simple
and usable access controls.

I think that the solution is to make it easier to run or use a catalog, not
to try to build without one.

And I'm also looking forward to what Jack is alluding to.

On Tue, Nov 26, 2024 at 11:05 PM Ajantha Bhat  wrote:

> Interesting.
>
> We already have file system tables [1] in Iceberg (HadoopCatalog
> implements this spec).
> We deprecated this recently and we don't have to deprecate it if object
> stores support atomic operations like this.
>
> [1] https://iceberg.apache.org/spec/#file-system-tables
>
> - Ajantha
>
> On Wed, Nov 27, 2024 at 2:53 AM Nikhil Benesch 
> wrote:
>
>> Ah, fascinating. Thanks very much for the pointer.
>>
>> Here's the thread introducing the proposal [0], for anyone else curious.
>>
>> [0]: https://lists.apache.org/thread/kh4n98w4z22sc8h2vot4q8n44vdtnltg
>>
>> On Tue, Nov 26, 2024 at 3:27 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi Vignesh
>> >
>> > Thanks for the reminder, I remember we quickly discussed this during a
>> > community meeting.
>> >
>> > I will take a new look at the doc.
>> >
>> > Regards
>> > JB
>> >
>> > On Tue, Nov 26, 2024 at 9:19 PM Vignesh  wrote:
>> > >
>> > > Hi,
>> > > There was a proposal along the same lines, for the read portion few
>> weeks back by Ashvin.
>> > >
>> > >
>> https://docs.google.com/document/d/1yzLXSOtzBXyaWHfeVsWsMu4xmOH8rV6QyM5ZAnJZjMQ/edit?usp=drivesdk
>> > >
>> > > Thanks,
>> > > Vignesh.
>> > >
>> > >
>> > > On Tue, Nov 26, 2024, 11:59 AM Jean-Baptiste Onofré 
>> wrote:
>> > >>
>> > >> Hi Nikhil
>> > >>
>> > >> Thanks for your message, very interesting.
>> > >>
>> > >> I think it would be great to involve the Polaris project here as
>> well,
>> > >> as a REST Catalog implementation.
>> > >> The Polaris community is discussing storage/backend right now, so it
>> > >> would be the perfect timi

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-11-27 Thread Kevin Liu

Thanks for driving this Honah!

It's important to have a consistent naming scheme so that we don't need to
worry about edge cases when using multiple engines, and possibly have to
deal with migrations.

Also, since users can store arbitrary key/value pairs in the summary
property, it's good to document the currently used properties to avoid
collision.

I like the proposal to document all properties in a "snapshot summary"
table, this will ensure a centralized place to view all possible key/value
pairs, similar to how FileIO configuration is handled in iceberg-python
. Other
implementations can use this table as a reference.

 > This approach offers flexibility, as new fields can be added through
documentation updates without requiring specification changes.
This will save a lot of effort since specification changes require
greater scrutiny.

> summary details would not be located near the Snapshot section, which
explains the summary field.
We can link the table to the Snapshot section.


Would love to hear others' thoughts on this.

Best,
Kevin Liu

On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:

> Hi everyone,
>
> I’d like to propose an addition to the table specification to document
> optional fields in the snapshot summary.
>
> Currently, the snapshot summary includes a required operation field and
> various optional fields. While these optional fields—such as metrics and
> partition-level summaries—are supported by Java
> 
> and Python
> 
> implementations, they are not officially documented. This creates risks of
> inconsistency as other implementations and engines adopt and interact with
> these fields.
>
> I propose adding a new section to the table specification to document
> these optional fields, ensuring consistent naming conventions and reducing
> ambiguity across implementations. While this is the primary proposal, it
> may also be worth discussing whether documenting these fields separately in
> Docs/Table would provide additional flexibility for future updates.
>
> I’d love to hear your thoughts, suggestions, or concerns about this
> proposal.
>
> Looking forward to the discussion!
>
> Links
>
>- GitHub tracking issue: https://github.com/apache/iceberg/issues/11659
>- Proposal:
>
> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
>- PR: https://github.com/apache/iceberg/pull/11660
>
>
> Best regards,
> Honah
>

Re: [DISCUSS] Enforce table properties at catalog level

2024-11-27 Thread rdb...@gmail.com

Manu, this is something that you can easily build into a REST catalog
implementation. I think that's probably the best way to solve it, rather
than trying to implement this behavior across all of the catalogs in the
project, right?

On Wed, Nov 27, 2024 at 8:47 AM Pucheng Yang 
wrote:

> I think the naming of the property should be fixed as it only applies for
> any new table creation.
>
> On Wed, Nov 27, 2024 at 2:21 AM Manu Zhang 
> wrote:
>
>> Hi all,
>>
>> Currently, we can *enforce default table properties* at catalog level
>> with configs like
>> spark.sql.catalog.*catalog-name*.table-override.*propertyKey*[1].  It
>> prevents users from overriding those properties when creating a table.
>> However, users can still override later through altering the table.
>> The Spark doc is inconsistent saying that the table-override property
>> can't be overridden by user. Which one is expected?
>>
>>
>> 1. 
>> https://iceberg.apache.org/docs/nightly/spark-configuration/#catalog-configuration
>> 
>>
>>
>> Thanks,
>> Manu
>>
>

Re: [DISCUSS] iceberg rust 0.4.0 and iceberg pyiceberg_core 0.1.0 release

2024-11-27 Thread Fokko Driesprong

Hey Sung,

All for it, and happy to help as well. I'll add it to the agenda for tomorrow's
Rust sync .
We'll make sure to publish the notes since it is on a US holiday.

Kind regards,
Fokko

Op wo 27 nov 2024 om 19:30 schreef Kevin Liu :

> Thanks for driving this, Sung! I'm +1 to release both iceberg-rust and
> pyiceberg_core. It's very exciting to see pyiceberg_core and its
> integration with PyIceberg.
> It makes sense to decouple pyiceberg_core from iceberg-rust since the two
> "projects" are on different tracks. We'd want to release pyiceberg_core
> features independent of iceberg-rust features.
>
> Please let me know if there's anything I can do to help.
>
> Best,
> Kevin Liu
>
> On Wed, Nov 27, 2024 at 6:13 AM Sung Yun  wrote:
>
>> Hi folks, it's been some time since we've done an Iceberg Rust release,
>> and we've finally set up the ghactions workflow[1] that will allow us to
>> build and publish an abi3 compatible wheel to Pypi.
>>
>> If we are still +1 for the release (both iceberg-rust and
>> pyiceberg_core), I think it'll be awesome to get this release out soon as
>> it will help the PyIceberg community test out the pyiceberg_core binding in
>> preparation for the next release.
>>
>> Another option would be to introduce a workflow_dispatch trigger to the
>> python_release.yml and run a decoupled, release for pyiceberg_core[2]
>>
>> I'd be happy to help run the release, if no one has started looking into
>> it already.
>>
>> Sung
>>
>> [1] https://github.com/apache/iceberg-rust/pull/705
>> [2] https://lists.apache.org/thread/j22o7yktrlddrgkcy7gl88o23nyrgooc
>>
>> On 2024/09/05 14:06:10 xianjin wrote:
>> > +1 for this pyiceberg_core as well.
>> >
>> >
>> >
>> > Two cents about the iceberg-rust release schedule: it seems too
>> aggressive to
>> > release by 2 weeks, monthly(4 weeks) release would be a nice fit.
>> >
>> > Sent from my iPhone
>> >
>> >
>> >
>> > > On Sep 5, 2024, at 8:25 PM, Sung Yun  wrote:
>> > >
>> > >
>> >
>> > > 
>> > >
>> > > Thank you for driving this Xuanwo!
>> > >
>> > >
>> > >
>> > >
>> > > +1 as well, as noted the 0.1.0 pyiceberg_core release will allow
>> PyIceberg
>> > > to begin integrating with the rust based core and introduce a new
>> feature
>> > > that the community is looking for.
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Sep 5, 2024 at 6:05 AM Renjie Liu
>> > > <[liurenjie2...@gmail.com](mailto:liurenjie2...@gmail.com)> wrote:
>> > >
>> > >
>> >
>> > >> +1 for this release.
>> > >
>> > >>
>> >
>> > >>
>> > >
>> > >>
>> >
>> > >> As iceberg-rust is under fast development, a shorter release (3-4
>> weeks)
>> > schedule would benefit users so that they don't need to rely on a
>> snapshot
>> > version.
>> >
>> > >>
>> >
>> > >>
>> > >
>> > >>
>> >
>> > >> On Thu, Sep 5, 2024 at 3:26 PM Xuanwo
>> > <[xua...@apache.org](mailto:xua...@apache.org)> wrote:
>> > >
>> > >>
>> >
>> > >>> Hello, everyone
>> > >
>> > >  I'm starting this thread to discuss the release of iceberg rust
>> 0.4.0 and
>> > > iceberg pyiceberg_core 0.1.0.
>> > >
>> > >  There is no specific reason for this release. I just want to align
>> with the
>> > > two- to three-week release schedule of iceberg rust so users don't
>> have to
>> > > wait long or encounter too many breaking changes at once.
>> > >
>> > >  Additionally, the pyiceberg team is awaiting our first release of
>> > > pyiceberg_core 0.1.0 so they can integrate with it, see how it works,
>> and
>> > > explore ways to improve collaboration.
>> > >
>> > >  What do you think?
>> > >
>> > >  Xuanwo
>> > >
>> > >  
>> > >
>> >
>> >
>>
>

Re: [DISCUSS] Hive Support

2024-11-27 Thread rdb...@gmail.com

I think that we should remove Hive 2 and Hive 3. We already agreed to
remove Hive 2, but Hive 3 is not compatible with the project anymore and is
already EOL and will not see a release to update it so that it can be
compatible. Anyone using the existing Hive 3 support should be able to
continue using older releases.

In general, I think it's a good idea to let people use older releases when
these situations happen. It is difficult for the project to continue to
support libraries that are EOL and I don't think there's a great
justification for it, considering Iceberg support in Hive 4 is native and
much better!

On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan  wrote:

> That said, it would be helpful if they continue running
> tests against the latest stable Hive releases to ensure that any
> changes don’t unintentionally break something for Hive, which would be
> beyond our control.
>
>
> I believe we should continue maintaining a Hive Iceberg runtime test suite
> with the latest version of Hive in the Iceberg repository.
>
>
> i think we can keep some basic Hive4 tests in iceberg repo
>
>
> Instead of running basic tests on the Iceberg repo, maybe let Iceberg
> publish daily snapshot jars to Nexus, and have a daily CI in Hive to
> consume those jars and run full Iceberg tests makes more sense?
>
> Thanks,
> Cheng Pan
>
>

Re: [DISCUSS] Deprecate embedded manifests

2024-11-27 Thread rdb...@gmail.com

I think it's reasonable to mark it deprecated in the spec, especially
because we don't allow it in v2. But I'm not sure how that would allow us
to remove code paths associated with it. If it is allowed by an older and
supported version of the spec, then how can we safely remove the code paths
that read it?

On Fri, Nov 22, 2024 at 2:56 AM Fokko Driesprong  wrote:

> Hey Ryan,
>
> The goal of the deprecation is to avoid other implementations to produce
> it. PyIceberg for example, does not support this and I think it would be
> good to avoid having others (rust, go, etc) to support this. Regarding the
> removal, Amogh expressed the same concern on the PR
> .
>
> In my quest to make the Java implementation follow the spec as closely as
> possible, I noticed that we use a DummyFileIO to mimic a ManifestList. I
> ran into this when turning
> 503:
> added_snapshot_id
>  into a
> required field
> . So the
> value is in removing paths, as Shezon pointed out. When removing support
> for the embedded manifest list, we can remove all that logic and keep the
> codebase nice and tidy.
>
> It would be good to start the discussion of deprecating support for older
> formats at some point, however, for a V2 reader is it fairly easy to
> project V1 metadata as V2. Except when embedded manifests are being used,
> marking this kind of oddities as deprecated I think will enable readers to
> support reading older versions for a longer time. My suggestion would be to
> mark the field as deprecated and revisit the actual removal. I've marked it
> up for removal in Java 2.0 for now to give it enough time.
>
> Kind regards,
> Fokko
>
>
>
> Op do 21 nov 2024 om 20:52 schreef rdb...@gmail.com :
>
>> Can we safely deprecate and remove this? The manifest list is required in
>> v2, but the spec has stated for a long time that v1 tables can use
>> manifests rather than a manifest list. It’s unlikely, but it would be
>> valid for other implementations to produce it.
>>
>> I would understand if other implementations chose to fail tables that
>> don’t have a manifest list to avoid adding code to handle manifests, but
>> I don’t think that there’s much value in removing support from the Java
>> implementation.
>>
>> Instead, what about discussing how to deprecate support for older format
>> versions? That seems like the main issue here. Once the majority of
>> implementations move to newer versions, we would like to deprecate the old
>> ones.
>>
>> On Thu, Nov 21, 2024 at 11:01 AM Szehon Ho 
>> wrote:
>>
>>> +1, great to have less possible paths.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang
>>>  wrote:
>>>
 +1 to deprecate

 Thanks,
 Steve Zhang



 On Nov 19, 2024, at 3:32 AM, Fokko Driesprong  wrote:

 Hi everyone,

 I would like to propose to deprecate embedded manifests
 . This has been used
 before the manifest-list was introduced, but I don't think they are used
 since the project has been open-sourced, and it would be good to
 officially deprecate them from the spec. It is only supported by Iceberg
 Java today, and I haven't seen any requests for PyIceberg to add support
 for this.

 Any questions or concerns about deprecating the embedded manifests?

 Kind regards,
 Fokko Driesprong

Re: [VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Sung Yun

Hi Kevin,

Yes, that approach sounds good to me as well. And thanks for the
explanation!

Sung

On Wed, Nov 27, 2024 at 8:17 PM Kevin Liu  wrote:

> Hey Sung,
>
> Good point. For context, I accidentally generated and uploaded to PyPi a
> version with `0.8.1` instead of `0.8.1rc1`. Fokko helped me yank that
> version. https://pypi.org/project/pyiceberg/0.8.1/
>
> If this RC passes, we can un-yank and reuse the currently uploaded
> version. Otherwise, I can create a new patch version using `0.8.2`. How
> does that sound?
>
> Additionally, I created a PR to prevent this from happening again.
> https://github.com/apache/iceberg-python/pull/1386
>
> Best,
> Kevin Liu
>
> On Wed, Nov 27, 2024 at 5:07 PM Sung Yun  wrote:
>
>> Hi Kevin,
>>
>> Thank you so much for working on this release!
>>
>> I noticed this morning that PyIceberg 0.8.1 was released and yanked[1]
>> this morning. Similar to how we had handled it when this had happened last
>> time, I think this would mean that we would need to now move on to the next
>> version and publish it as a PyIceberg 0.8.2 release instead. Hence, I think
>> it would make sense to start a new vote thread with the incremented version.
>>
>> Sung
>>
>> [1] https://pypi.org/project/pyiceberg/
>>
>> On Wed, Nov 27, 2024 at 7:55 PM Kevin Liu  wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we release the following RC as the official PyIceberg
>>> 0.8.1 release.
>>>
>>> The commit ID is a051584a3684392d2db6556449eb299145d47d15
>>>
>>> * This corresponds to the tag: pyiceberg-0.8.1rc1
>>> (17124779c5294cb928f3807ed539f427f9b4bd2e)
>>> *
>>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
>>> *
>>> https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15
>>>
>>> The release tarball, signature, and checksums are here:
>>>
>>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/
>>>
>>> You can find the KEYS file here:
>>>
>>> * https://downloads.apache.org/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged on pypi:
>>>
>>> https://pypi.org/project/pyiceberg/0.8.1rc1/
>>>
>>> And can be installed using: pip3 install pyiceberg==0.8.1rc1
>>>
>>> Instructions for verifying a release can be found here:
>>>
>>> * https://py.iceberg.apache.org/verify-release/
>>>
>>> High-Level Summary
>>> *Breaking Changes*
>>> * The `Table.name` method now returns the table name *without the
>>> catalog name*, as part of a broader effort to remove catalog references
>>> in PyIceberg.
>>>   * Replace usages of `Table.identifier` with `Table.name` in the
>>> codebase
>>>   * Replace usages of the deprecated function
>>> (`identifier_to_tuple_without_catalog`) in the codebase which removes
>>> unnecessary warnings
>>>
>>>
>>> *Bug fixes** Fix `add_files` for parquet files missing column statistics
>>> * Allow leading underscore in column name used in row filter
>>> * Ignore Glue and Hive tables missing the `table_type` property
>>> * Write `null` in manifest list metadata when there is no
>>> `parent-snapshot-id`
>>>
>>>
>>> *Dependency Updates** Removed upper-bound restrictions on dependencies;
>>> allow early testing of new versions:
>>>   * Remove Python library version upper bound restriction; allow Python
>>> 3.13
>>>   * Remove fsspec library version upper bound restriction
>>>
>>>
>>> *Documentation Updates** Improve “how to release” documentation
>>> * Included post-release steps for version 0.8.0
>>> * Included documentation updates in this patch release to reflect these
>>> changes in https://py.iceberg.apache.org/
>>>
>>> *Commit Summary*
>>> * [36 new commits since the `0.8.0` release](
>>> https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).
>>>
>>> * 12 new commits will be included in 0.8.1
>>>   * 11 commits cherry-picked as bug fixes (listed below)
>>>   * 1 [commit](
>>> https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
>>> to bump version to `0.8.1`
>>>
>>> *Detailed Commits*
>>> * acbd071 Write `null` when there is no parent-snapshot-id (#1383)
>>> * bb078cf Add instruction for patch release (#1373)
>>> * ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
>>> have column stats (#1354)
>>> * cc1ab2c Improve documentation for "how to release" (#1359)
>>> * 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
>>> * d86ab6e Allow leading underscore in column name used in row filter
>>> (#1358)
>>> * 7a4734e Replace reference of `Table.identifier` with `Table.name`
>>> (#1346)
>>> * a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
>>> * 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
>>> * 7660a5b 0.8.0 post release steps (#1334)
>>> * b2f0a9e use the non-deprecated func (#1326)
>>>
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>> [ ] +1 Release this as PyIceberg 0.8.1
>>> [ ] +0
>>> [ ] -1 Do not release this

Re: [VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Kevin Liu

Hey Sung,

Good point. For context, I accidentally generated and uploaded to PyPi a
version with `0.8.1` instead of `0.8.1rc1`. Fokko helped me yank that
version. https://pypi.org/project/pyiceberg/0.8.1/

If this RC passes, we can un-yank and reuse the currently uploaded version.
Otherwise, I can create a new patch version using `0.8.2`. How does that
sound?

Additionally, I created a PR to prevent this from happening again.
https://github.com/apache/iceberg-python/pull/1386

Best,
Kevin Liu

On Wed, Nov 27, 2024 at 5:07 PM Sung Yun  wrote:

> Hi Kevin,
>
> Thank you so much for working on this release!
>
> I noticed this morning that PyIceberg 0.8.1 was released and yanked[1]
> this morning. Similar to how we had handled it when this had happened last
> time, I think this would mean that we would need to now move on to the next
> version and publish it as a PyIceberg 0.8.2 release instead. Hence, I think
> it would make sense to start a new vote thread with the incremented version.
>
> Sung
>
> [1] https://pypi.org/project/pyiceberg/
>
> On Wed, Nov 27, 2024 at 7:55 PM Kevin Liu  wrote:
>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official PyIceberg
>> 0.8.1 release.
>>
>> The commit ID is a051584a3684392d2db6556449eb299145d47d15
>>
>> * This corresponds to the tag: pyiceberg-0.8.1rc1
>> (17124779c5294cb928f3807ed539f427f9b4bd2e)
>> *
>> https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
>> *
>> https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15
>>
>> The release tarball, signature, and checksums are here:
>>
>> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/
>>
>> You can find the KEYS file here:
>>
>> * https://downloads.apache.org/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on pypi:
>>
>> https://pypi.org/project/pyiceberg/0.8.1rc1/
>>
>> And can be installed using: pip3 install pyiceberg==0.8.1rc1
>>
>> Instructions for verifying a release can be found here:
>>
>> * https://py.iceberg.apache.org/verify-release/
>>
>> High-Level Summary
>> *Breaking Changes*
>> * The `Table.name` method now returns the table name *without the
>> catalog name*, as part of a broader effort to remove catalog references
>> in PyIceberg.
>>   * Replace usages of `Table.identifier` with `Table.name` in the codebase
>>   * Replace usages of the deprecated function
>> (`identifier_to_tuple_without_catalog`) in the codebase which removes
>> unnecessary warnings
>>
>>
>> *Bug fixes** Fix `add_files` for parquet files missing column statistics
>> * Allow leading underscore in column name used in row filter
>> * Ignore Glue and Hive tables missing the `table_type` property
>> * Write `null` in manifest list metadata when there is no
>> `parent-snapshot-id`
>>
>>
>> *Dependency Updates** Removed upper-bound restrictions on dependencies;
>> allow early testing of new versions:
>>   * Remove Python library version upper bound restriction; allow Python
>> 3.13
>>   * Remove fsspec library version upper bound restriction
>>
>>
>> *Documentation Updates** Improve “how to release” documentation
>> * Included post-release steps for version 0.8.0
>> * Included documentation updates in this patch release to reflect these
>> changes in https://py.iceberg.apache.org/
>>
>> *Commit Summary*
>> * [36 new commits since the `0.8.0` release](
>> https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).
>>
>> * 12 new commits will be included in 0.8.1
>>   * 11 commits cherry-picked as bug fixes (listed below)
>>   * 1 [commit](
>> https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
>> to bump version to `0.8.1`
>>
>> *Detailed Commits*
>> * acbd071 Write `null` when there is no parent-snapshot-id (#1383)
>> * bb078cf Add instruction for patch release (#1373)
>> * ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
>> have column stats (#1354)
>> * cc1ab2c Improve documentation for "how to release" (#1359)
>> * 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
>> * d86ab6e Allow leading underscore in column name used in row filter
>> (#1358)
>> * 7a4734e Replace reference of `Table.identifier` with `Table.name`
>> (#1346)
>> * a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
>> * 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
>> * 7660a5b 0.8.0 post release steps (#1334)
>> * b2f0a9e use the non-deprecated func (#1326)
>>
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>> [ ] +1 Release this as PyIceberg 0.8.1
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Best,
>> Kevin Liu
>>
>

[VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Kevin Liu

Hi Everyone,

I propose that we release the following RC as the official PyIceberg 0.8.1
release.

The commit ID is a051584a3684392d2db6556449eb299145d47d15

* This corresponds to the tag: pyiceberg-0.8.1rc1
(17124779c5294cb928f3807ed539f427f9b4bd2e)
* https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
*
https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15

The release tarball, signature, and checksums are here:

* https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/

You can find the KEYS file here:

* https://downloads.apache.org/iceberg/KEYS

Convenience binary artifacts are staged on pypi:

https://pypi.org/project/pyiceberg/0.8.1rc1/

And can be installed using: pip3 install pyiceberg==0.8.1rc1

Instructions for verifying a release can be found here:

* https://py.iceberg.apache.org/verify-release/

High-Level Summary
*Breaking Changes*
* The `Table.name` method now returns the table name *without the catalog
name*, as part of a broader effort to remove catalog references in
PyIceberg.
  * Replace usages of `Table.identifier` with `Table.name` in the codebase
  * Replace usages of the deprecated function
(`identifier_to_tuple_without_catalog`) in the codebase which removes
unnecessary warnings


*Bug fixes** Fix `add_files` for parquet files missing column statistics
* Allow leading underscore in column name used in row filter
* Ignore Glue and Hive tables missing the `table_type` property
* Write `null` in manifest list metadata when there is no
`parent-snapshot-id`


*Dependency Updates** Removed upper-bound restrictions on dependencies;
allow early testing of new versions:
  * Remove Python library version upper bound restriction; allow Python 3.13
  * Remove fsspec library version upper bound restriction


*Documentation Updates** Improve “how to release” documentation
* Included post-release steps for version 0.8.0
* Included documentation updates in this patch release to reflect these
changes in https://py.iceberg.apache.org/

*Commit Summary*
* [36 new commits since the `0.8.0` release](
https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).

* 12 new commits will be included in 0.8.1
  * 11 commits cherry-picked as bug fixes (listed below)
  * 1 [commit](
https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
to bump version to `0.8.1`

*Detailed Commits*
* acbd071 Write `null` when there is no parent-snapshot-id (#1383)
* bb078cf Add instruction for patch release (#1373)
* ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
have column stats (#1354)
* cc1ab2c Improve documentation for "how to release" (#1359)
* 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
* d86ab6e Allow leading underscore in column name used in row filter (#1358)
* 7a4734e Replace reference of `Table.identifier` with `Table.name` (#1346)
* a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
* 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
* 7660a5b 0.8.0 post release steps (#1334)
* b2f0a9e use the non-deprecated func (#1326)


Please download, verify, and test.

Please vote in the next 72 hours.
[ ] +1 Release this as PyIceberg 0.8.1
[ ] +0
[ ] -1 Do not release this because...

Best,
Kevin Liu

Re: [DISCUSS] Hive Support

2024-11-27 Thread Ajantha Bhat

+1 to remove support for both Hive2 and Hive3 in the latest Iceberg release
as it has reached EOL.

Hive4 is natively managing Iceberg integration, similar to how Trino
handles its Iceberg integration. Therefore, in my opinion, it would be
better for engines to manage the integration aspect, allowing the Iceberg
community to focus on the specification and table format.

- Ajantha

On Thu, Nov 28, 2024 at 12:47 AM Fokko Driesprong  wrote:

> Hey Cheng,
>
> Thanks for the suggestion. The nightly snapshots are available:
> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
> which might help when working on features that are not released yet (eg
> Nanosecond timestamps). Besides that, we should run RCs against Hive to
> check if everything works as expected.
>
> I'm leaning toward removing Hive 2 and 3 as well.
>
> Kind regards,
> Fokko
>
> Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com :
>
>> I think that we should remove Hive 2 and Hive 3. We already agreed to
>> remove Hive 2, but Hive 3 is not compatible with the project anymore and is
>> already EOL and will not see a release to update it so that it can be
>> compatible. Anyone using the existing Hive 3 support should be able to
>> continue using older releases.
>>
>> In general, I think it's a good idea to let people use older releases
>> when these situations happen. It is difficult for the project to continue
>> to support libraries that are EOL and I don't think there's a great
>> justification for it, considering Iceberg support in Hive 4 is native and
>> much better!
>>
>> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan  wrote:
>>
>>> That said, it would be helpful if they continue running
>>> tests against the latest stable Hive releases to ensure that any
>>> changes don’t unintentionally break something for Hive, which would be
>>> beyond our control.
>>>
>>>
>>> I believe we should continue maintaining a Hive Iceberg runtime test
>>> suite with the latest version of Hive in the Iceberg repository.
>>>
>>>
>>> i think we can keep some basic Hive4 tests in iceberg repo
>>>
>>>
>>> Instead of running basic tests on the Iceberg repo, maybe let Iceberg
>>> publish daily snapshot jars to Nexus, and have a daily CI in Hive to
>>> consume those jars and run full Iceberg tests makes more sense?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>

Re: [VOTE] Release Apache PyIceberg 0.8.1rc1

2024-11-27 Thread Sung Yun

Hi Kevin,

Thank you so much for working on this release!

I noticed this morning that PyIceberg 0.8.1 was released and yanked[1] this
morning. Similar to how we had handled it when this had happened last time,
I think this would mean that we would need to now move on to the next
version and publish it as a PyIceberg 0.8.2 release instead. Hence, I think
it would make sense to start a new vote thread with the incremented version.

Sung

[1] https://pypi.org/project/pyiceberg/

On Wed, Nov 27, 2024 at 7:55 PM Kevin Liu  wrote:

> Hi Everyone,
>
> I propose that we release the following RC as the official PyIceberg 0.8.1
> release.
>
> The commit ID is a051584a3684392d2db6556449eb299145d47d15
>
> * This corresponds to the tag: pyiceberg-0.8.1rc1
> (17124779c5294cb928f3807ed539f427f9b4bd2e)
> * https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.8.1rc1
> *
> https://github.com/apache/iceberg-python/tree/a051584a3684392d2db6556449eb299145d47d15
>
> The release tarball, signature, and checksums are here:
>
> * https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-0.8.1rc1/
>
> You can find the KEYS file here:
>
> * https://downloads.apache.org/iceberg/KEYS
>
> Convenience binary artifacts are staged on pypi:
>
> https://pypi.org/project/pyiceberg/0.8.1rc1/
>
> And can be installed using: pip3 install pyiceberg==0.8.1rc1
>
> Instructions for verifying a release can be found here:
>
> * https://py.iceberg.apache.org/verify-release/
>
> High-Level Summary
> *Breaking Changes*
> * The `Table.name` method now returns the table name *without the catalog
> name*, as part of a broader effort to remove catalog references in
> PyIceberg.
>   * Replace usages of `Table.identifier` with `Table.name` in the codebase
>   * Replace usages of the deprecated function
> (`identifier_to_tuple_without_catalog`) in the codebase which removes
> unnecessary warnings
>
>
> *Bug fixes** Fix `add_files` for parquet files missing column statistics
> * Allow leading underscore in column name used in row filter
> * Ignore Glue and Hive tables missing the `table_type` property
> * Write `null` in manifest list metadata when there is no
> `parent-snapshot-id`
>
>
> *Dependency Updates** Removed upper-bound restrictions on dependencies;
> allow early testing of new versions:
>   * Remove Python library version upper bound restriction; allow Python
> 3.13
>   * Remove fsspec library version upper bound restriction
>
>
> *Documentation Updates** Improve “how to release” documentation
> * Included post-release steps for version 0.8.0
> * Included documentation updates in this patch release to reflect these
> changes in https://py.iceberg.apache.org/
>
> *Commit Summary*
> * [36 new commits since the `0.8.0` release](
> https://github.com/apache/iceberg-python/compare/pyiceberg-0.8.0...acbd071375ac4cc2053435346737a3b1a64cce2e).
>
> * 12 new commits will be included in 0.8.1
>   * 11 commits cherry-picked as bug fixes (listed below)
>   * 1 [commit](
> https://github.com/apache/iceberg-python/commit/58389dfe5cf5f6ef6ea16c47cd11408c642fafd1)
> to bump version to `0.8.1`
>
> *Detailed Commits*
> * acbd071 Write `null` when there is no parent-snapshot-id (#1383)
> * bb078cf Add instruction for patch release (#1373)
> * ab43c6c fix `KeyError` raised by `add_files` when parquet file doe not
> have column stats (#1354)
> * cc1ab2c Improve documentation for "how to release" (#1359)
> * 64dc6fe Remove Python 3.13 upper bound restriction (#1355)
> * d86ab6e Allow leading underscore in column name used in row filter
> (#1358)
> * 7a4734e Replace reference of `Table.identifier` with `Table.name` (#1346)
> * a66ddc0 Ignore tables without `table_type` from Glue and Hive (#1332)
> * 2cbc77d Drop upper bounds for fsspec and it's implementations (#1341)
> * 7660a5b 0.8.0 post release steps (#1334)
> * b2f0a9e use the non-deprecated func (#1326)
>
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
> [ ] +1 Release this as PyIceberg 0.8.1
> [ ] +0
> [ ] -1 Do not release this because...
>
> Best,
> Kevin Liu
>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho

Yea, I think that part is definitely kept.

Thanks
Szehon

On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com  wrote:

> I'd support changing the behavior if we still have a way to match the
> intent, which is to return true if the table exists in Hive and is an
> Iceberg table.
>
> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho 
> wrote:
>
>> Hm I think the thread got a bit sidetracked by the other question.
>>
>> The initial proposal by Steve is a performance improvement for
>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>> table metadata, and if successful returns true.  The proposal is to load
>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>> table exists with this name.
>>
>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>> the current behavior, but would not anymore with the proposed change.
>> That's the question of the original thread, and so far there's agreement
>> that it is not necessarily part of this scope of HiveCatalog's
>> tableExists().
>>
>> At least this is my understanding.
>> Thanks,
>> Szehon
>>
>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com 
>> wrote:
>>
>>> What kind of corruption are you referring to? I would expect corruption
>>> to result in an exception when loading the table, but that the table should
>>> still exist. The problem is likely that we determine if a table exists by
>>> attempting to load it. We could fix that by not attempting to load the
>>> table. I think that's a reasonable solution.
>>>
>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
>>> wrote:
>>>
 The current behavior's intent is not to check whether the metadata is
> valid, it is to detect whether the table is an Iceberg table.


 Is there a way to detect this from HiveCatalog without loading the
 table?


 On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
 wrote:

> I think we have an agreement, not to change the behavior wrt existing
> non-Iceberg tables, and throw an exception.
>
> Are we also in agreement with the original proposal to return true
> when the table exists but the metadata is somehow corrupted? Note: this is
> the proposed change of behavior why the thread was originally started.
>
> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com  wrote:
>
>> I'd argue against changing this. The current behavior's intent is not
>> to check whether the metadata is valid, it is to detect whether the table
>> is an Iceberg table. It ignores non-Iceberg tables. Changing that 
>> behavior
>> would be surprising, especially if we started throwing exceptions.
>>
>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
>> wrote:
>>
>>> > Should add, my personal preference is probably not to change the
>>> existing behavior for this part
>>>
>>> +1. I realized that this is not a new behavior. The `loadTable`
>>> implementation has this problem too.
>>> It would be good to have a test case specifically for this edge case
>>> and maybe call this out in the documentation.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>>> wrote:
>>>
 Should add, my personal preference is probably not to change the
 existing behavior for this part (false, if exists a Hive table with 
 same
 name) at the moment, just adding another possibility for consideration.

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
 wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I
> guess a third option instead of returning true/false in this case, is 
> to
> change it to throw an NoSuchIcebergTableException if its a non-Iceberg
> table, which I think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>  wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then
>> confuse the opposite use case of this function. Let's assume that we 
>> change
>> the implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>> table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it
>> it fails because it's not an Iceberg table. I don't think that any 
>> of these
>> 2 are intuitive. I think the question here is how much an API of the
>> Iceberg table format should know about the existence of tables in 
>> other
>> formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the
>>> HMS
>>
>> Another interpretation of calli

Re: [DISCUSS] Apache Iceberg Summit 2025 - Selection Committee

2024-11-27 Thread Christian Thiel

Hey JB,

happy to help any way I can. Thanks for organizing this!

Best,
Christian

On 27. Nov 2024, at 07:52, Fokko Driesprong  wrote:

Hey JB,

Thanks for organizing this. Happy to help!

Kind regards,
Fokko

Op wo 27 nov 2024 om 06:23 schreef karuppayya 
mailto:karuppayya1...@gmail.com>>:
Hi JB, I am happy to help with this.
- Karuppayya

On Tue, Nov 26, 2024 at 8:55 PM Renjie Liu 
mailto:liurenjie2...@gmail.com>> wrote:
Hi, JB:

Thanks for driving this. Happy to help!

On Wed, Nov 27, 2024 at 9:13 AM Bill Zhang  wrote:
Hi JB,

Happy to help.

Bill

> On Nov 26, 2024, at 4:42 AM, Jean-Baptiste Onofré 
> mailto:j...@nanthrax.net>> wrote:
>
> Hi everyone,
>
> As you probably know, we've been having discussions about the Iceberg
> Summit 2025.
>
> The PMC pre-approved the Iceberg Summit proposal, and one of the first
> steps is to put together a selection committee that will be
> responsible for choosing talks and guiding the process.
> Once we have a selection committee, I will complete the concrete
> proposal for the ASF and the Iceberg PMC to request the ability to use
> the name Iceberg/Apache Iceberg.
>
> If you'd like to help and be part of the selection committee, please
> volunteer in a reply to this thread. Since we likely can't include
> everyone that volunteers, I propose that the PMC should choose the
> final committee from the set of people that volunteer.
>
> We'll leave this open up to Dec 10th to give people time (as
> Thanksgiving is this week).
>
> Thanks !
> Regards
> JB

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Manu Zhang

>
> The current behavior's intent is not to check whether the metadata is
> valid, it is to detect whether the table is an Iceberg table.


Is there a way to detect this from HiveCatalog without loading the table?


On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
wrote:

> I think we have an agreement, not to change the behavior wrt existing
> non-Iceberg tables, and throw an exception.
>
> Are we also in agreement with the original proposal to return true when
> the table exists but the metadata is somehow corrupted? Note: this is the
> proposed change of behavior why the thread was originally started.
>
> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com  wrote:
>
>> I'd argue against changing this. The current behavior's intent is not to
>> check whether the metadata is valid, it is to detect whether the table is
>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>> would be surprising, especially if we started throwing exceptions.
>>
>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:
>>
>>> > Should add, my personal preference is probably not to change the
>>> existing behavior for this part
>>>
>>> +1. I realized that this is not a new behavior. The `loadTable`
>>> implementation has this problem too.
>>> It would be good to have a test case specifically for this edge case and
>>> maybe call this out in the documentation.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>>> wrote:
>>>
 Should add, my personal preference is probably not to change the
 existing behavior for this part (false, if exists a Hive table with same
 name) at the moment, just adding another possibility for consideration.

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
 wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
> third option instead of returning true/false in this case, is to change it
> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which 
> I
> think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>  wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then confuse
>> the opposite use case of this function. Let's assume that we change the
>> implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>> table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it it
>> fails because it's not an Iceberg table. I don't think that any of these 
>> 2
>> are intuitive. I think the question here is how much an API of the 
>> Iceberg
>> table format should know about the existence of tables in other formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the HMS
>>
>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>> the 2 approaches are better than the other, I just wanted to show that
>> there is another side of the coin :)
>>
>> Regards,
>> Gabor
>>
>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>> wrote:
>>
>>> Hi Steve,
>>>
>>> This makes sense to me. The semantics of `tableExists` focus on
>>> whether a table's name exists in the catalog. For the Hive catalog,
>>> checking the HMS entry should be sufficient.
>>>
>>> I do have a question about usage, though. Typically, I would use `
>>> tableExists` like this:
>>>
>>> ```
>>> if (!tableExists(table_name)) {
>>> table = createTable(table_name);
>>> }
>>> ```
>>> What happens when a Hive table with the same name already exists in
>>> the catalog? In the current implementation, `tableExists` would return
>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>> `NoSuchTableException`.
>>> This would cause the code above to attempt to create the table, only
>>> to fail since the name already exists in the HMS.
>>> If `tableExists` is meant to check for conflicting entries in the
>>> HMS, perhaps it should return true even when a Hive table with the same
>>> name exists.
>>>
>>> I’d love to hear your thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>>> wrote:
>>>
 Hi,

 It's a good performance find and improvement.   Left some comment
 on the PR.

 IMO, the behavior actually more matches the API javadoc ("Check
 whether table exists"), not whether it is corrupted or not, so I'm
 supportive of it.

 Thanks
 Szehon

 On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
  wrote:
>>

Re: Storing catalog directly on object store

2024-11-27 Thread Xuanwo

Hi

I believe we still need to deprecate HadoopCatalog since the operation is still 
not safe on Hadoop. As raised by Jack Ye before, I suggest we consider having a 
StorageCatalog or ObjectStorageCatalog that can only be used with storage 
services supporting conditional writes. That would be a good approach.

On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
> Makes sense! I'd be eager to chat more about this but I'm afraid I won't be at
> re:Invent. Maybe we plan to circle back after re:Invent, once we see what AWS
> announces?
>
> On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré  
> wrote:
>>
>> Hi Nikhil
>>
>> Thanks for your message, very interesting.
>>
>> I think it would be great to involve the Polaris project here as well,
>> as a REST Catalog implementation.
>> The Polaris community is discussing storage/backend right now, so it
>> would be the perfect timing to consider leveraging S3 conditional
>> writes (as a plugin for instance first).
>>
>> I would be happy to connect and know more about your perspective about that.
>>
>> Thanks,
>> Regards
>> JB
>>
>> PS: I will be at AWS re:Invent next week, so maybe we can connect there.
>>
>> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch  
>> wrote:
>> >
>> > Hi all,
>> >
>> > With Amazon S3 announcing support for the If-Match header yesterday [0], 
>> > all the
>> > major object store implementations now support a compare-and-swap 
>> > operation.
>> >
>> > As far as I can tell, this opens up the possibility of storing Iceberg
>> > catalogs directly on object storage, without the need for a separate 
>> > metastore,
>> > and without violating any of Iceberg's ACID guarantees.
>> >
>> > It seems the immediate next step is to build an independent Java or REST 
>> > catalog
>> > backend to prove this concept out. Long term, though, the ideal would be to
>> > have such a catalog backend be a first class citizen in the Iceberg 
>> > project.
>> >
>> > Is anyone else in the Iceberg community barking up this tree? I'm a long 
>> > term
>> > Iceberg enthusiast, but new to the community. I'd very much appreciate any
>> > pointers to current or past discussions on the topic. So far all I've been
>> > able to turn up is some light chatter from myself and others on Bluesky and
>> > Hacker News ([1][2][3]).
>> >
>> > Cheers,
>> > Nikhil
>> >
>> > [0]: 
>> > https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
>> > [1]: https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
>> > [2]: https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
>> > [3]: https://news.ycombinator.com/item?id=42240370

-- 
Xuanwo

https://xuanwo.io/

[DISCUSS] Enforce table properties at catalog level

2024-11-27 Thread Manu Zhang

Hi all,

Currently, we can *enforce default table properties* at catalog level with
configs like
spark.sql.catalog.*catalog-name*.table-override.*propertyKey*[1].  It
prevents users from overriding those properties when creating a table.
However, users can still override later through altering the table.
The Spark doc is inconsistent saying that the table-override property can't
be overridden by user. Which one is expected?

1. 
https://iceberg.apache.org/docs/nightly/spark-configuration/#catalog-configuration



Thanks,
Manu

Re: Storing catalog directly on object store

2024-11-27 Thread Gabor Kaszab

Hi All,

Xuanwo, I recall the reasoning against HadoopCatalog was the other way
around: even though it is safe to use on HDFS, it is unsafe on object
storage. I believe that this gap of functionalities of object stores seems
to go away, so for me HadoopCatalog would even make more sense now than
before. The name might not be straightforward as it's not just for Hadoop.

Regards,
Gabor


On Wed, Nov 27, 2024 at 9:02 AM Xuanwo  wrote:

> Hi
>
> I believe we still need to deprecate HadoopCatalog since the operation is
> still not safe on Hadoop. As raised by Jack Ye before, I suggest we
> consider having a StorageCatalog or ObjectStorageCatalog that can only be
> used with storage services supporting conditional writes. That would be a
> good approach.
>
> On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
> > Makes sense! I'd be eager to chat more about this but I'm afraid I won't
> be at
> > re:Invent. Maybe we plan to circle back after re:Invent, once we see
> what AWS
> > announces?
> >
> > On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Nikhil
> >>
> >> Thanks for your message, very interesting.
> >>
> >> I think it would be great to involve the Polaris project here as well,
> >> as a REST Catalog implementation.
> >> The Polaris community is discussing storage/backend right now, so it
> >> would be the perfect timing to consider leveraging S3 conditional
> >> writes (as a plugin for instance first).
> >>
> >> I would be happy to connect and know more about your perspective about
> that.
> >>
> >> Thanks,
> >> Regards
> >> JB
> >>
> >> PS: I will be at AWS re:Invent next week, so maybe we can connect there.
> >>
> >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
> nikhil.bene...@gmail.com> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > With Amazon S3 announcing support for the If-Match header yesterday
> [0], all the
> >> > major object store implementations now support a compare-and-swap
> operation.
> >> >
> >> > As far as I can tell, this opens up the possibility of storing Iceberg
> >> > catalogs directly on object storage, without the need for a separate
> metastore,
> >> > and without violating any of Iceberg's ACID guarantees.
> >> >
> >> > It seems the immediate next step is to build an independent Java or
> REST catalog
> >> > backend to prove this concept out. Long term, though, the ideal would
> be to
> >> > have such a catalog backend be a first class citizen in the Iceberg
> project.
> >> >
> >> > Is anyone else in the Iceberg community barking up this tree? I'm a
> long term
> >> > Iceberg enthusiast, but new to the community. I'd very much
> appreciate any
> >> > pointers to current or past discussions on the topic. So far all I've
> been
> >> > able to turn up is some light chatter from myself and others on
> Bluesky and
> >> > Hacker News ([1][2][3]).
> >> >
> >> > Cheers,
> >> > Nikhil
> >> >
> >> > [0]:
> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
> >> > [1]: https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
> >> > [2]:
> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
> >> > [3]: https://news.ycombinator.com/item?id=42240370
>
> --
> Xuanwo
>
> https://xuanwo.io/
>

Re: [DISCUSS] Hive Support

2024-11-27 Thread Gabor Kaszab

Hi All,

As I see there is a general opinion on not keeping the Hive code in the
Iceberg repo, but maintaining a set of tests that verifies the actual
Iceberg code against the latest Hive release. For me it would seem a bit
odd to maintain a test suite for verifying some code that is not maintained
within this repo. In a similar fashion we could maintain a test suite for
any other query engine, not just for Hive. I think it's either code+tests
or none.

I'd rather challenge the general consensus of Option 1 (remove Hive code
from Iceberg) and I'd like to understand the motivation why this
whole replication of code happened between Iceberg and Hive. If I read well
between the lines Hive developers found it difficult to get their PRs
merged or even reviewed by Iceberg committers and hence decided to
replicate the code and do their own implementation. However, I haven't seen
any communication about raising awareness of these difficulties.
I think that people have put some serious efforts into the Hive code within
the Iceberg repo and before dropping it we should take a step back and see
if the current situation can be fixed somehow:
 - Can we raise awareness that the code reviewing bandwidth of the Hive PRs
is not sufficient? (If this was indeed the motivation of the Hive devs)
 - Would it be feasible to collect a list of PRs that would be required to
push the missing Hive related code into Iceberg?
 - Would it be possible to get some commitment from the people reviewing
Iceberg-Hive code that they will try to find some time taking a look at
these?

Let me know if the above doesn't make any sense, though!
Regards,
Gabor

On Tue, Nov 26, 2024 at 9:31 PM Simhadri G  wrote:

>
>
> Hi Everyone,
>
> Thank you, Peter, for the discussion!
>
> I’m also leaning toward option one. However, given that Apache Iceberg is
> designed to be engine-agnostic, I believe we should continue maintaining a
> Hive Iceberg runtime test suite with the latest version of Hive in the
> Iceberg repository. This will help identify any changes that could break
> Hive compatibility early on.
>
> So I agree with ayush, denys and Butao on option . I think Options 2 and 3
> would be difficult , as they would require a significant amount of time and
> effort from the community to maintain.
>
>
> Thanks,
> Simhadri G
>
>
>
> On Tue, Nov 26, 2024, 7:50 AM Butao Zhang  wrote:
>
>> Hi folks,
>>
>>  Firstly Thanks Peter for bringing it up!  I also think
>> option 1 is a more reasonable solution right now, as we have developed lots
>> of advanced iceberg features in hive repo, such as mor & cow & compaction,
>> etc, and these feats are coupled with Hive core code base. Hive
>> runtime/connector in iceberg repo can not easily make this advanced feats
>> happen. So in the long term, drop Hive runtime from iceberg repo and
>> maintain it in Hive repo is more sensible.
>>
>>  BTW, i have did some work about upgrading iceberg in Hive
>> repo, like HIVE-28495. We often backport some hive-iceberg related commits
>> from Iceberg repo to Hive repo.  What i noticed that the iceberg-catalog in
>> Hive repo (equals to hive-metastore in Iceberg repo)  rarely changes. As
>> Denys said above *we could potentially drop it from Hive repo and maybe
>> rename to `hive-catalog` in iceberg.  *I think it makes more sense to
>> keep the hive catalog in one place.  But i am not sure if the hive catalog
>> will be coupled with hive core codes when developing some Upcoming advanced
>> features. If being coupled with hive core codes, it's better to stay in
>> Hive repo.  Some folks who know more about catalogs can give more context.
>>
>>   About the hive(Hive 4) test integration in iceberg repo, in
>> general, i think we can keep some basic Hive4 tests in iceberg repo, as
>> this not only makes iceberg core more stable, but also ensures that
>> Hive4's  iceberg runtime will not be damaged at time. I have seen that
>> Trino repo did some Spark integration testing(
>> https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java)
>> . Maybe we can consider this way.
>>
>>
>>
>> Thanks,
>> Butao Zhang
>>  Replied Message 
>> From Wing Yew Poon 
>> Date 11/26/2024 05:50
>> To  
>> Cc  
>> Subject Re: [DISCUSS] Hive Support
>> For the Hive runtime, would it be feasible for the Hive community to
>> contribute a suite of tests to the Iceberg repo that can be run with
>> dependencies from the latest Hive release (Hive 4.x), and then update them
>> from time to time as appropriate? The purpose of this suite would be to
>> test integration of Iceberg core with the Hive runtime. Perhaps the
>> existing tests in the mr and hive3 modules could be a starting point, or
>> you might decide on different tests altogether.
>> The development of the Hive runtime would then continue as now in the
>> Hive repo, but you gain better assurance of compatibility with ongoing

Re: Storing catalog directly on object store

2024-11-27 Thread Manu Zhang

I think one major issue with current HadoopCatalog is that there's no way
to manage tables by name. If adding one metadata layer on top of it, we
need to handle more consistency challenges.

Manu

On Wed, Nov 27, 2024 at 8:03 PM Gabor Kaszab  wrote:

> Hi All,
>
> Xuanwo, I recall the reasoning against HadoopCatalog was the other way
> around: even though it is safe to use on HDFS, it is unsafe on object
> storage. I believe that this gap of functionalities of object stores seems
> to go away, so for me HadoopCatalog would even make more sense now than
> before. The name might not be straightforward as it's not just for Hadoop.
>
> Regards,
> Gabor
>
>
> On Wed, Nov 27, 2024 at 9:02 AM Xuanwo  wrote:
>
>> Hi
>>
>> I believe we still need to deprecate HadoopCatalog since the operation is
>> still not safe on Hadoop. As raised by Jack Ye before, I suggest we
>> consider having a StorageCatalog or ObjectStorageCatalog that can only be
>> used with storage services supporting conditional writes. That would be a
>> good approach.
>>
>> On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
>> > Makes sense! I'd be eager to chat more about this but I'm afraid I
>> won't be at
>> > re:Invent. Maybe we plan to circle back after re:Invent, once we see
>> what AWS
>> > announces?
>> >
>> > On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> Hi Nikhil
>> >>
>> >> Thanks for your message, very interesting.
>> >>
>> >> I think it would be great to involve the Polaris project here as well,
>> >> as a REST Catalog implementation.
>> >> The Polaris community is discussing storage/backend right now, so it
>> >> would be the perfect timing to consider leveraging S3 conditional
>> >> writes (as a plugin for instance first).
>> >>
>> >> I would be happy to connect and know more about your perspective about
>> that.
>> >>
>> >> Thanks,
>> >> Regards
>> >> JB
>> >>
>> >> PS: I will be at AWS re:Invent next week, so maybe we can connect
>> there.
>> >>
>> >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
>> nikhil.bene...@gmail.com> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > With Amazon S3 announcing support for the If-Match header yesterday
>> [0], all the
>> >> > major object store implementations now support a compare-and-swap
>> operation.
>> >> >
>> >> > As far as I can tell, this opens up the possibility of storing
>> Iceberg
>> >> > catalogs directly on object storage, without the need for a separate
>> metastore,
>> >> > and without violating any of Iceberg's ACID guarantees.
>> >> >
>> >> > It seems the immediate next step is to build an independent Java or
>> REST catalog
>> >> > backend to prove this concept out. Long term, though, the ideal
>> would be to
>> >> > have such a catalog backend be a first class citizen in the Iceberg
>> project.
>> >> >
>> >> > Is anyone else in the Iceberg community barking up this tree? I'm a
>> long term
>> >> > Iceberg enthusiast, but new to the community. I'd very much
>> appreciate any
>> >> > pointers to current or past discussions on the topic. So far all
>> I've been
>> >> > able to turn up is some light chatter from myself and others on
>> Bluesky and
>> >> > Hacker News ([1][2][3]).
>> >> >
>> >> > Cheers,
>> >> > Nikhil
>> >> >
>> >> > [0]:
>> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
>> >> > [1]: https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
>> >> > [2]:
>> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
>> >> > [3]: https://news.ycombinator.com/item?id=42240370
>>
>> --
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>

38 matches

Mail list logo