Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Péter Váry
Currently we have a 'static' 2 level manifest structure. If we introduce the 'everything is a manifest' concept then we will remove the limit on the levels. This would prevent concurrent reading of the embedded manifests (if the table has 5 levels of embedded manifests the reader needs to read thos

Re: [VOTE] Add Variant type to Iceberg Spec

2024-11-22 Thread Micah Kornfield
My (non-binding) vote is -1 until the variant spec is formally adopted in Parquet. On Fri, Nov 22, 2024 at 2:51 PM Aihua Xu wrote: > Hi everyone, > > I've updated the Iceberg spec to include the new Variant type as part of > #10831 . The changes are

[VOTE] Add Variant type to Iceberg Spec

2024-11-22 Thread Aihua Xu
Hi everyone, I've updated the Iceberg spec to include the new Variant type as part of #10831 . The changes are basically complete. This is a heads-up about the upcoming change. Please review and +1 to acknowledge, so we will merge. Thanks, Aihua

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Kevin Liu
> Should add, my personal preference is probably not to change the existing behavior for this part +1. I realized that this is not a new behavior. The `loadTable` implementation has this problem too. It would be good to have a test case specifically for this edge case and maybe call this out in th

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Should add, my personal preference is probably not to change the existing behavior for this part (false, if exists a Hive table with same name) at the moment, just adding another possibility for consideration. Thanks Szehon On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho wrote: > Thanks Kevin and Gab

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Jean-Baptiste Onofré
Hi That's correct: in Sung's PR, I can see the secret.DOCKERHUB_USER and secret.DOCKERHUB_TOKEN. So, we should be able to publish docker images via this GitHub action ;) Regards JB On Fri, Nov 22, 2024 at 6:16 PM Fokko Driesprong wrote: > > I think Sung beat you to it: https://github.com/apache

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Micah Kornfield
Would cadding the ability to have a list of manifest lists solve this problem? This might be an incremental step to getting to "everything" is a manifest? For now I wanted to reuse the existing manifest-list and manifests fields. Regardless of the outcome, please let's not re-use a field in a w

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Jan Kaul
Thanks for your feedback. About your concerns Fokko: 1. Generally the number of manifest files in the manifests field shouldn't get too large. But I think you can already improve the write amplification and conflict resolution with using up to 10 manifest files. The fact that the manifests fi

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Kevin Liu
Awesome! Thanks, Sung! :) On Fri, Nov 22, 2024 at 9:16 AM Fokko Driesprong wrote: > I think Sung beat you to it: https://github.com/apache/iceberg/pull/11632 > > As mentioned earlier it would be awesome if we could have a nightly build > so we can test all the different languages against the nig

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Fokko Driesprong
I think Sung beat you to it: https://github.com/apache/iceberg/pull/11632 As mentioned earlier it would be awesome if we could have a nightly build so we can test all the different languages against the nightly. In this case, when there are changes or new features, we can test/implement them right

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Kevin Liu
Thanks for setting this up, JB! It looks like PR #11283 is close to being merged. What is the deployment strategy for the Docker image? Ideally, this process could be fully automated using GitHub and GitHub Actions. I’d love to hear everyone’s though

Re: [VOTE] Release Apache Iceberg 1.7.1 RC1

2024-11-22 Thread Kevin Liu
Thanks for adding that PR to the patch release! This should only affect the release script, and only this once :). I see that the documentation site has already been updated. https://iceberg.apache.org/how-to-release/#setup Best, Kevin Liu On Fri, Nov 22, 2024 at 6:36 AM Bryan Keller wrote: > A

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Zoltán Borók-Nagy
Awesome! In Impala we created our own implementations so far, but it will be nice to join forces and have a common library. Looking forward to the Slack channel. Cheers, Zoltan On Fri, Nov 22, 2024 at 5:01 PM Gang Wu wrote: > > I have created an issue [1] to collect initial ideas for the i

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Russell Spitzer
I would much rather we switch to the "everything is a manifest approach. Instead of manifest lists we only ever have manifests. A Manifest can then link to data files or additional manifests. In the case of streaming then you only ever have to read and write a single manifest. If we couple this wit

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
I have created an issue [1] to collect initial ideas for the iceberg-cpp project. Any feedback is appreciated. [1] https://github.com/apache/iceberg-cpp/issues/2 Best, Gang On Fri, Nov 22, 2024 at 10:11 PM Matt Topol wrote: > I will also help out with the iceberg-cpp effort, please include me

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

2024-11-22 Thread Taeyun Kim
Hi, Since ETags are opaque values to the client, attributing any semantic meaning to them in the interaction between the client and server would, in my opinion, constitute a misuse/abuse of the HTTP specification. On the other hand, the server can generate the ETag value as any string, as long

Re: [DISCUSS] Deprecate embedded manifests

2024-11-22 Thread Fokko Driesprong
Hey Jan, Thanks for the heads-up, that's an interesting proposal. I've shared my thoughts on the thread itself. Keep in mind that this would be a spec-change as well, as it now explicitly states that they can be populated both

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Fokko Driesprong
Hi Jan, Thanks for sending out this proposal. While reading through it, two questions pop up: - You mentioned repurposing the manifests field. Currently, this field contains a list of paths that point to the manifest data. Would this also be your suggestion? This way, when committing the

Re: [PR] Add ASF yaml [iceberg-cpp]

2024-11-22 Thread via GitHub
Xuanwo merged PR #1: URL: https://github.com/apache/iceberg-cpp/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@iceberg.apache.or

[PR] Add ASF yaml [iceberg-cpp]

2024-11-22 Thread via GitHub
Fokko opened a new pull request, #1: URL: https://github.com/apache/iceberg-cpp/pull/1 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [VOTE] Release Apache Iceberg 1.7.1 RC1

2024-11-22 Thread Bryan Keller
Apologies! I see Kevin updated this in the release script and docs already, in https://github.com/apache/iceberg/pull/11526. (The email template now points to https://downloads.apache.org/iceberg/KEYS.) Here's a PR to merge this to 1.7.x in case we have another patch release: https://github.com

Re: [DISCUSS] Deprecate embedded manifests

2024-11-22 Thread Jan Kaul
Hi all, I've been thinking about how we could make Iceberg tables more performant for streaming inserts. And I thought about using the manifests field as a buffer for manifest files before they are written to the manifest-list. This reduces the write amplification and simplifies the conflict

[DISCUSS] Proposal to buffer manifest files before updating manifest-list

2024-11-22 Thread Jan Kaul
Hi all, I'd like to propose an optimization for how we track manifest files in Iceberg tables, specifically focusing on reducing write amplification and simplifying conflict resolution during fast-append operations. Background: Replace vs. Change-Based Updates To frame this proposal,

Re: [DISCUSS] Hive Support

2024-11-22 Thread Manu Zhang
Hi Peter and Fokko, What about Cheng Pan's point that there will be duplicated implementations in Hive and Iceberg if we upgrade iceberg-hive3 to iceberg-hive4? On Fri, Nov 22, 2024 at 5:18 PM Fokko Driesprong wrote: > I agree with Péter, that sounds like the right approach to me as well. > > K

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Matt Topol
I will also help out with the iceberg-cpp effort, please include me on the channel. While my focus will still be on iceberg-go, I'll happily review and contribute to the C++ implementation. I do also plan on eventually implementing puffin in the iceberg-go repo lol --Matt On Fri, Nov 22, 2024 at

Re: [PROPOSAL] Create Iceberg DockerHub repository

2024-11-22 Thread Jean-Baptiste Onofré
Hi folks, I created the iceberg repo on DockerHub (in the Apache org): https://hub.docker.com/r/apache/iceberg I created an "Iceberg team" on DockerHub. I created DOCKERHUB_USER and DOCKERHUB_TOKEN credentials for the Iceberg repo. That will allow us to directly push on DockerHub repo from GitH

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Raúl Cumplido
This sounds awesome. I am looking forward to the slack channel being available so I can also help! El vie, 22 nov 2024 a las 10:03, Gang Wu () escribió: > > Thanks for the support, Fokko and JB! > > Please include me in the cpp slack channel for future cooperation. > > Best, > Gang > > On Fri, Nov

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

2024-11-22 Thread Zoltán Borók-Nagy
Hi, Separate version information forces the clients to manage a Table -> VersionIdentifier mapping which adds unnecessary complexity and can be error-prone. If the VersionIdentifier is embedded in the Table object then the application logic is much simpler, and the Catalog interface is not only s

Re: [DISCUSS] Deprecate embedded manifests

2024-11-22 Thread Fokko Driesprong
Hey Ryan, The goal of the deprecation is to avoid other implementations to produce it. PyIceberg for example, does not support this and I think it would be good to avoid having others (rust, go, etc) to support this. Regarding the removal, Amogh expressed the same concern on the PR

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

2024-11-22 Thread Gabor Kaszab
Hi Taeyun, Thanks for the writeup! Let me reflect to some areas: the caller manages the version identifier separately. Since the callers of this interface would be the query engines themselves most of the cases, this would mean that Impala, Spark, Hive, Trino, etc. would need to implement their

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Thanks Kevin and Gabor, this is an interesting discussion. I guess a third option instead of returning true/false in this case, is to change it to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I think is actually what this pr does? Thanks Szehon On Fri, Nov 22, 2024 at 1

Re: [DISCUSS] Hive Support

2024-11-22 Thread Fokko Driesprong
I agree with Péter, that sounds like the right approach to me as well. Kind regards, Fokko Op vr 22 nov 2024 om 07:38 schreef Péter Váry : > I would prefer B, and only revert to A if we find that B becomes too > complicated. > > On Fri, Nov 22, 2024, 04:26 Manu Zhang wrote: > >> Hi Peter, >> >>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Gabor Kaszab
Hey, I think what Kevin says makes sense. However, it would then confuse the opposite use case of this function. Let's assume that we change the implementation of tableExists() to not load the table internally: if (tableExists(table_name)) { table = loadTable(table_name); } Here, you find th

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
Thanks for the support, Fokko and JB! Please include me in the cpp slack channel for future cooperation. Best, Gang On Fri, Nov 22, 2024 at 4:58 PM Jean-Baptiste Onofré wrote: > Hi Gabor, > > I think it makes sense to create iceberg-cpp resources (repository, > slack channel, ...): this can ga

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Jean-Baptiste Onofré
Hi Gabor, I think it makes sense to create iceberg-cpp resources (repository, slack channel, ...): this can gather the efforts in the Iceberg lib and Puffin implementation. Fokko can help there, he can ping me if needed (from ASF standpoint) :) Regards JB On Fri, Nov 22, 2024 at 9:25 AM Gabor K

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Fokko Driesprong
Hey Gabor, Sorry for not replying in this thread earlier. I fully missed this thread since it was right in the middle of my prenatal leave. As already mentioned there is the Iceberg-Rust effort, and the Iceberg-Go effort is progressing nicely (I would expect to have bindings to cpp at some point)

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
Thanks Gabor for the update! Yes, I think the iceberg-rust and iceberg-go are going on pretty well. I have an interest in working on a brand new iceberg-cpp project. I have done some research and found that this is a long-awaited project but has made no progress so far. I'd like to hear the voice

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gabor Kaszab
Hi Iceberg Community, It's been a while since we started this discussion. I'd like to revive the conversion for two reasons: 1) I think I'll have some capacity starting from early next year to take care of the C++ Puffin stuff we already talked about above, and also from the Impala community we co

Re: [VOTE] Release Apache Iceberg 1.7.1 RC1

2024-11-22 Thread Jean-Baptiste Onofré
Hi Yufei, As discussed on the dev mailing list (with Fokko), the KEYS file to use is: https://dist.apache.org/repos/dist/release/iceberg/KEYS Regards JB On Fri, Nov 22, 2024 at 6:36 AM Yufei Gu wrote: > > Hi Bryan, > > This link seems broken, https://dist.apache.org/repos/dist/dev/iceberg/KEYS.