I think there is also a point we were discussing but never closed regarding AppendDeleteFiles, if that should be supported. The recent development in Kafka, and vendor products like Upsolver Zero-ETL <https://www.upsolver.com/blog/upsolver-announces-zero-etl-and-lakehouse-optimization-for-apache-iceberg> seems to suggest that there is a demand for people to also just append/stream deletes to a table. So I think it would be ideal that we can support both data and delete files if we create a new append API.
For RemoveDataFiles and RemoveDeleteFiles, does that mean we need another tables/{table}/remove endpoint? Also, how would such endpoints work with multi-table transactions? Let's think through those points. -Jack On Tue, Feb 20, 2024 at 3:13 PM Drew <img...@gmail.com> wrote: > Hi everyone, > > As we are discussing the rest spec changes to add support for DataFiles > and DeleteFiles for both appends and scan planning API (PR: > https://github.com/apache/iceberg/pull/9717). One thing that came up for > appends was that this logic shouldn’t be in the table update API but > instead it should have a dedicated endpoint. This would be beneficial for a > few use cases such as, asynchronous appends, and batch commit support. > > I’d like to start a discussion on thoughts around introducing this new > endpoint and its functionality to support the ongoing fine-grained metadata > commit efforts. From the discussion in the ContentFile spec change PR, the > proposed endpoint was envisioned as an append update handling update > requests asynchronously. The link to that discussion can be found here: > https://github.com/apache/iceberg/pull/9717#discussion_r1495005890. The > proposed changes include: > > *Endpoint*: > POST /v1/{prefix}/namespaces/{namespace}/tables/{table}/append > > *Request*: > { > "accept-delay-ms": 300000, // acceptable delay for processing > "data-files": [...] > } > > *Response*: > 202 accepted > { > “location”: > “/v1/{prefix}/namespaces/{namespace}/tables/{table}/status/{id}“ // used to > track status > } > > I'm interested in gathering your thoughts on the asynchronous operation > model and the suggested endpoint structure. > > Building on this, we previously discussed having these update options: > RemoveDataFiles and RemoveDeleteFiles. Given this new endpoint structure, > we should consider whether or not we should have unified or separate > endpoints for these operations. For instance, should we organize these > under a shared endpoint and specify operationType, or do establish distinct > endpoints for these operations? Given that appends can support batch > processing, we can accommodate this in the request model. > > Thank you, > Drew > > On Fri, Jan 26, 2024 at 5:06 PM Drew <img...@gmail.com> wrote: > >> Hey everyone, >> >> I wanted to provide a quick update on the progress of the commit API >> proposal. Based on the feedback in the design doc and the Slack >> conversation with Dan and Jack, we've reached an agreement that this is >> more of a fine-grained metadata commit, rather than a data operation or >> commit. For the next steps, I'll be focusing on validating the requirements >> for the update requests. Additionally, I'll be working on adding the >> necessary tests to ensure its end-to-end functionality. >> >> Thanks for all the feedback, I still have an open PR for an appendFiles. >> If you have a chance to review, I would appreciate any additional feedback >> you may have. >> >> https://github.com/apache/iceberg/pull/9292 >> >> Best, >> >> Drew >> >> On Fri, Jan 12, 2024 at 3:40 PM Drew <img...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> I hope you all had great holidays! I wanted to resurface this proposal >>> for RESTful Data operations. >>> >>> Currently, I have a open PR here: >>> https://github.com/apache/iceberg/pull/9292 >>> >>> Thanks, >>> Drew >>> >>> On Wed, Dec 13, 2023 at 3:04 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Thanks Drew for the quick turnaround, I will take a deeper look into >>>> the PR. >>>> >>>> I think if we all agree that it is beneficial to have the >>>> AppendFIles(DataFile[]) API (maybe we should call it AppendRows instead), I >>>> would like to know if it also makes sense to have: >>>> 1. DeleteRows(DeleteFile[]), which can allow users to describe the >>>> deletion of rows easily through the equality delete spec >>>> 2. combine the 2 APIs of AppendRows and DeleteRows to one single type >>>> of action >>>> >>>> I find it pretty intuitive from a user perspective to express deletion >>>> of rows and commit them through equality deletes, and it would allow >>>> performing updates through simple applications. >>>> >>>> -Jack >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Dec 13, 2023 at 2:22 PM Drew <img...@gmail.com> wrote: >>>> >>>>> Hi Ryan, >>>>> >>>>> Thanks for the feedback, I'll start going through the comments left in >>>>> the doc! You're right in pointing out that the logic here can be >>>>> simplified >>>>> to roll back a commit. For now I introduced a smaller PR, that focuses on >>>>> the append files operation. >>>>> >>>>> Github PR: https://github.com/apache/iceberg/pull/9292 >>>>> Drew >>>>> >>>>> >>>>> On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> > Based on my understanding of the proposal, I think it's more about >>>>>> the possibility of enabling other ways that do not require a full >>>>>> rollback. >>>>>> it's just currently we implemented it as a rollback to prove the >>>>>> feasibility. >>>>>> >>>>>> My main question is this: what can be done besides rolling back a >>>>>> commit? And why does that require 5 extra routes and metadata writes from >>>>>> the REST service? >>>>>> >>>>>> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> > The proposal is to roll back rewrite commits, but that's already >>>>>>> possible with the much simpler API that exists today. >>>>>>> >>>>>>> Based on my understanding of the proposal, I think it's more about >>>>>>> the possibility of enabling other ways that do not require a full >>>>>>> rollback. >>>>>>> it's just currently we implemented it as a rollback to prove the >>>>>>> feasibility. But given that now we have full access to the changes of >>>>>>> each >>>>>>> data commit (compared to only the post-change snapshot), we could >>>>>>> potentially reuse some files that have been rewritten. >>>>>>> >>>>>>> > I'm skeptical that there is a benefit to implementing the set of >>>>>>> data operations from the Java API >>>>>>> >>>>>>> +1, the current Java API might be a bit redundant, some APIs serve >>>>>>> very similar purposes. I feel the important data actions to have from >>>>>>> the >>>>>>> end user's perspective are basically the ability to (1) AddRows, (2) >>>>>>> DeleteRows? >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> Thanks, Drew. >>>>>>>> >>>>>>>> I think it's a good idea in general to be able to perform commits >>>>>>>> on the server-side, but I would much rather break this down into >>>>>>>> smaller >>>>>>>> parts. I would definitely want to start with just file append use >>>>>>>> cases, >>>>>>>> since I think that is the biggest win. It can reduce retries and is an >>>>>>>> easy >>>>>>>> way to write from non-JVM languages or just simpler applications. >>>>>>>> >>>>>>>> I'm skeptical that there is a benefit to implementing the set of >>>>>>>> data operations from the Java API. That's primarily because I don't >>>>>>>> think >>>>>>>> that use case 1 (better conflict resolution) is actually achieved. You >>>>>>>> can >>>>>>>> avoid retries on the client, but the retries must happen _somewhere_. >>>>>>>> The >>>>>>>> proposal is to roll back rewrite commits, but that's already possible >>>>>>>> with >>>>>>>> the much simpler API that exists today. Maybe I'm missing something? >>>>>>>> >>>>>>>> Even if I'm mistaken about being able to improve conflict >>>>>>>> resolution, I think that there is quite a bit of work here and I'd >>>>>>>> break >>>>>>>> this down either way. Starting with append use cases makes a lot of >>>>>>>> sense >>>>>>>> to me, but I'm interested to hear what others think as well. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew >>>>>>>> <d...@amazon.com.invalid> wrote: >>>>>>>> >>>>>>>>> In regards to the multiple emails sent earlier, please use this >>>>>>>>> one for discussions. >>>>>>>>> >>>>>>>>> Thanks you! >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2023/12/07 00:47:42 Drew wrote: >>>>>>>>> > Hi everyone, >>>>>>>>> > >>>>>>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at >>>>>>>>> Amazon EMR >>>>>>>>> > and Athena. I’m reaching out to share a proposal that introduces >>>>>>>>> data >>>>>>>>> > commits as a part of the RESTCatalog. The current process for >>>>>>>>> data commits >>>>>>>>> > lives on the client side, and by shifting this logic into the >>>>>>>>> REST catalog, >>>>>>>>> > we can empower the catalog service with more control of this >>>>>>>>> process. >>>>>>>>> > >>>>>>>>> > This proposal addresses specific use cases that showcase the >>>>>>>>> benefits of >>>>>>>>> > moving the commit logic to the service side. For instance, this >>>>>>>>> shift >>>>>>>>> > allows the user to refine conflict resolution mechanisms, giving >>>>>>>>> precedence >>>>>>>>> > to operations that modify the table state to ensure their >>>>>>>>> completion >>>>>>>>> > without conflict. Furthermore, our POC demonstrated an >>>>>>>>> improvement in the >>>>>>>>> > success rate of concurrent write operations against the >>>>>>>>> GlueCatalog. This >>>>>>>>> > all can be found in the detailed proposal below. Feel free to >>>>>>>>> comment, and >>>>>>>>> > add your suggestions! >>>>>>>>> > >>>>>>>>> > Detailed proposal: >>>>>>>>> > >>>>>>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing >>>>>>>>> > Github POC: https://github.com/apache/iceberg/pull/9237 >>>>>>>>> > >>>>>>>>> > Looking forward to hearing back >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > >>>>>>>>> > Drew Gallardo >>>>>>>>> > Amazon EMR & Athena >>>>>>>>> > d...@amazon.com >>>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>