Hi Ryan, Thanks for the feedback, I'll start going through the comments left in the doc! You're right in pointing out that the logic here can be simplified to roll back a commit. For now I introduced a smaller PR, that focuses on the append files operation.
Github PR: https://github.com/apache/iceberg/pull/9292 Drew On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote: > > Based on my understanding of the proposal, I think it's more about the > possibility of enabling other ways that do not require a full rollback. > it's just currently we implemented it as a rollback to prove the > feasibility. > > My main question is this: what can be done besides rolling back a commit? > And why does that require 5 extra routes and metadata writes from the REST > service? > > On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote: > >> > The proposal is to roll back rewrite commits, but that's already >> possible with the much simpler API that exists today. >> >> Based on my understanding of the proposal, I think it's more about the >> possibility of enabling other ways that do not require a full rollback. >> it's just currently we implemented it as a rollback to prove the >> feasibility. But given that now we have full access to the changes of each >> data commit (compared to only the post-change snapshot), we could >> potentially reuse some files that have been rewritten. >> >> > I'm skeptical that there is a benefit to implementing the set of data >> operations from the Java API >> >> +1, the current Java API might be a bit redundant, some APIs serve very >> similar purposes. I feel the important data actions to have from the end >> user's perspective are basically the ability to (1) AddRows, (2) >> DeleteRows? >> >> -Jack >> >> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Thanks, Drew. >>> >>> I think it's a good idea in general to be able to perform commits on the >>> server-side, but I would much rather break this down into smaller parts. I >>> would definitely want to start with just file append use cases, since I >>> think that is the biggest win. It can reduce retries and is an easy way to >>> write from non-JVM languages or just simpler applications. >>> >>> I'm skeptical that there is a benefit to implementing the set of data >>> operations from the Java API. That's primarily because I don't think that >>> use case 1 (better conflict resolution) is actually achieved. You can avoid >>> retries on the client, but the retries must happen _somewhere_. The >>> proposal is to roll back rewrite commits, but that's already possible with >>> the much simpler API that exists today. Maybe I'm missing something? >>> >>> Even if I'm mistaken about being able to improve conflict resolution, I >>> think that there is quite a bit of work here and I'd break this down either >>> way. Starting with append use cases makes a lot of sense to me, but I'm >>> interested to hear what others think as well. >>> >>> Ryan >>> >>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <d...@amazon.com.invalid> >>> wrote: >>> >>>> In regards to the multiple emails sent earlier, please use this one for >>>> discussions. >>>> >>>> Thanks you! >>>> >>>> >>>> On 2023/12/07 00:47:42 Drew wrote: >>>> > Hi everyone, >>>> > >>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at >>>> Amazon EMR >>>> > and Athena. I’m reaching out to share a proposal that introduces data >>>> > commits as a part of the RESTCatalog. The current process for data >>>> commits >>>> > lives on the client side, and by shifting this logic into the REST >>>> catalog, >>>> > we can empower the catalog service with more control of this process. >>>> > >>>> > This proposal addresses specific use cases that showcase the benefits >>>> of >>>> > moving the commit logic to the service side. For instance, this shift >>>> > allows the user to refine conflict resolution mechanisms, giving >>>> precedence >>>> > to operations that modify the table state to ensure their completion >>>> > without conflict. Furthermore, our POC demonstrated an improvement in >>>> the >>>> > success rate of concurrent write operations against the GlueCatalog. >>>> This >>>> > all can be found in the detailed proposal below. Feel free to >>>> comment, and >>>> > add your suggestions! >>>> > >>>> > Detailed proposal: >>>> > >>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing >>>> > Github POC: https://github.com/apache/iceberg/pull/9237 >>>> > >>>> > Looking forward to hearing back >>>> > >>>> > Thanks, >>>> > >>>> > Drew Gallardo >>>> > Amazon EMR & Athena >>>> > d...@amazon.com >>>> > >>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >