Hi Ryan,

Thanks for the feedback, I'll start going through the comments left in the
doc! You're right in pointing out that the logic here can be simplified to
roll back a commit. For now I introduced a smaller PR, that focuses on the
append files operation.

Github PR: https://github.com/apache/iceberg/pull/9292
Drew


On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote:

> > Based on my understanding of the proposal, I think it's more about the
> possibility of enabling other ways that do not require a full rollback.
> it's just currently we implemented it as a rollback to prove the
> feasibility.
>
> My main question is this: what can be done besides rolling back a commit?
> And why does that require 5 extra routes and metadata writes from the REST
> service?
>
> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> > The proposal is to roll back rewrite commits, but that's already
>> possible with the much simpler API that exists today.
>>
>> Based on my understanding of the proposal, I think it's more about the
>> possibility of enabling other ways that do not require a full rollback.
>> it's just currently we implemented it as a rollback to prove the
>> feasibility. But given that now we have full access to the changes of each
>> data commit (compared to only the post-change snapshot), we could
>> potentially reuse some files that have been rewritten.
>>
>> > I'm skeptical that there is a benefit to implementing the set of data
>> operations from the Java API
>>
>> +1, the current Java API might be a bit redundant, some APIs serve very
>> similar purposes. I feel the important data actions to have from the end
>> user's perspective are basically the ability to (1) AddRows, (2)
>> DeleteRows?
>>
>> -Jack
>>
>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks, Drew.
>>>
>>> I think it's a good idea in general to be able to perform commits on the
>>> server-side, but I would much rather break this down into smaller parts. I
>>> would definitely want to start with just file append use cases, since I
>>> think that is the biggest win. It can reduce retries and is an easy way to
>>> write from non-JVM languages or just simpler applications.
>>>
>>> I'm skeptical that there is a benefit to implementing the set of data
>>> operations from the Java API. That's primarily because I don't think that
>>> use case 1 (better conflict resolution) is actually achieved. You can avoid
>>> retries on the client, but the retries must happen _somewhere_. The
>>> proposal is to roll back rewrite commits, but that's already possible with
>>> the much simpler API that exists today. Maybe I'm missing something?
>>>
>>> Even if I'm mistaken about being able to improve conflict resolution, I
>>> think that there is quite a bit of work here and I'd break this down either
>>> way. Starting with append use cases makes a lot of sense to me, but I'm
>>> interested to hear what others think as well.
>>>
>>> Ryan
>>>
>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <d...@amazon.com.invalid>
>>> wrote:
>>>
>>>> In regards to the multiple emails sent earlier, please use this one for
>>>> discussions.
>>>>
>>>> Thanks you!
>>>>
>>>>
>>>> On 2023/12/07 00:47:42 Drew wrote:
>>>> > Hi everyone,
>>>> >
>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at
>>>> Amazon EMR
>>>> > and Athena. I’m reaching out to share a proposal that introduces data
>>>> > commits as a part of the RESTCatalog. The current process for data
>>>> commits
>>>> > lives on the client side, and by shifting this logic into the REST
>>>> catalog,
>>>> > we can empower the catalog service with more control of this process.
>>>> >
>>>> > This proposal addresses specific use cases that showcase the benefits
>>>> of
>>>> > moving the commit logic to the service side. For instance, this shift
>>>> > allows the user to refine conflict resolution mechanisms, giving
>>>> precedence
>>>> > to operations that modify the table state to ensure their completion
>>>> > without conflict. Furthermore, our POC demonstrated an improvement in
>>>> the
>>>> > success rate of concurrent write operations against the GlueCatalog.
>>>> This
>>>> > all can be found in the detailed proposal below. Feel free to
>>>> comment, and
>>>> > add your suggestions!
>>>> >
>>>> > Detailed proposal:
>>>> >
>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing
>>>> > Github POC: https://github.com/apache/iceberg/pull/9237
>>>> >
>>>> > Looking forward to hearing back
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Drew Gallardo
>>>> > Amazon EMR & Athena
>>>> > d...@amazon.com
>>>> >
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to