Thanks Drew for the quick turnaround, I will take a deeper look into the PR.
I think if we all agree that it is beneficial to have the AppendFIles(DataFile[]) API (maybe we should call it AppendRows instead), I would like to know if it also makes sense to have: 1. DeleteRows(DeleteFile[]), which can allow users to describe the deletion of rows easily through the equality delete spec 2. combine the 2 APIs of AppendRows and DeleteRows to one single type of action I find it pretty intuitive from a user perspective to express deletion of rows and commit them through equality deletes, and it would allow performing updates through simple applications. -Jack On Wed, Dec 13, 2023 at 2:22 PM Drew <img...@gmail.com> wrote: > Hi Ryan, > > Thanks for the feedback, I'll start going through the comments left in the > doc! You're right in pointing out that the logic here can be simplified to > roll back a commit. For now I introduced a smaller PR, that focuses on the > append files operation. > > Github PR: https://github.com/apache/iceberg/pull/9292 > Drew > > > On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote: > >> > Based on my understanding of the proposal, I think it's more about the >> possibility of enabling other ways that do not require a full rollback. >> it's just currently we implemented it as a rollback to prove the >> feasibility. >> >> My main question is this: what can be done besides rolling back a commit? >> And why does that require 5 extra routes and metadata writes from the REST >> service? >> >> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> > The proposal is to roll back rewrite commits, but that's already >>> possible with the much simpler API that exists today. >>> >>> Based on my understanding of the proposal, I think it's more about the >>> possibility of enabling other ways that do not require a full rollback. >>> it's just currently we implemented it as a rollback to prove the >>> feasibility. But given that now we have full access to the changes of each >>> data commit (compared to only the post-change snapshot), we could >>> potentially reuse some files that have been rewritten. >>> >>> > I'm skeptical that there is a benefit to implementing the set of data >>> operations from the Java API >>> >>> +1, the current Java API might be a bit redundant, some APIs serve very >>> similar purposes. I feel the important data actions to have from the end >>> user's perspective are basically the ability to (1) AddRows, (2) >>> DeleteRows? >>> >>> -Jack >>> >>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Thanks, Drew. >>>> >>>> I think it's a good idea in general to be able to perform commits on >>>> the server-side, but I would much rather break this down into smaller >>>> parts. I would definitely want to start with just file append use cases, >>>> since I think that is the biggest win. It can reduce retries and is an easy >>>> way to write from non-JVM languages or just simpler applications. >>>> >>>> I'm skeptical that there is a benefit to implementing the set of data >>>> operations from the Java API. That's primarily because I don't think that >>>> use case 1 (better conflict resolution) is actually achieved. You can avoid >>>> retries on the client, but the retries must happen _somewhere_. The >>>> proposal is to roll back rewrite commits, but that's already possible with >>>> the much simpler API that exists today. Maybe I'm missing something? >>>> >>>> Even if I'm mistaken about being able to improve conflict resolution, I >>>> think that there is quite a bit of work here and I'd break this down either >>>> way. Starting with append use cases makes a lot of sense to me, but I'm >>>> interested to hear what others think as well. >>>> >>>> Ryan >>>> >>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <d...@amazon.com.invalid> >>>> wrote: >>>> >>>>> In regards to the multiple emails sent earlier, please use this one >>>>> for discussions. >>>>> >>>>> Thanks you! >>>>> >>>>> >>>>> On 2023/12/07 00:47:42 Drew wrote: >>>>> > Hi everyone, >>>>> > >>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at >>>>> Amazon EMR >>>>> > and Athena. I’m reaching out to share a proposal that introduces data >>>>> > commits as a part of the RESTCatalog. The current process for data >>>>> commits >>>>> > lives on the client side, and by shifting this logic into the REST >>>>> catalog, >>>>> > we can empower the catalog service with more control of this process. >>>>> > >>>>> > This proposal addresses specific use cases that showcase the >>>>> benefits of >>>>> > moving the commit logic to the service side. For instance, this shift >>>>> > allows the user to refine conflict resolution mechanisms, giving >>>>> precedence >>>>> > to operations that modify the table state to ensure their completion >>>>> > without conflict. Furthermore, our POC demonstrated an improvement >>>>> in the >>>>> > success rate of concurrent write operations against the GlueCatalog. >>>>> This >>>>> > all can be found in the detailed proposal below. Feel free to >>>>> comment, and >>>>> > add your suggestions! >>>>> > >>>>> > Detailed proposal: >>>>> > >>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing >>>>> > Github POC: https://github.com/apache/iceberg/pull/9237 >>>>> > >>>>> > Looking forward to hearing back >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Drew Gallardo >>>>> > Amazon EMR & Athena >>>>> > d...@amazon.com >>>>> > >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >