Re: Proposal for RESTful Data Operations

Drew Wed, 13 Dec 2023 14:22:48 -0800

Hi Ryan,

Thanks for the feedback, I'll start going through the comments left in the
doc! You're right in pointing out that the logic here can be simplified to
roll back a commit. For now I introduced a smaller PR, that focuses on the
append files operation.


Github PR: https://github.com/apache/iceberg/pull/9292
Drew


On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote:

> > Based on my understanding of the proposal, I think it's more about the
> possibility of enabling other ways that do not require a full rollback.
> it's just currently we implemented it as a rollback to prove the
> feasibility.
>
> My main question is this: what can be done besides rolling back a commit?
> And why does that require 5 extra routes and metadata writes from the REST
> service?
>
> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> > The proposal is to roll back rewrite commits, but that's already
>> possible with the much simpler API that exists today.
>>
>> Based on my understanding of the proposal, I think it's more about the
>> possibility of enabling other ways that do not require a full rollback.
>> it's just currently we implemented it as a rollback to prove the
>> feasibility. But given that now we have full access to the changes of each
>> data commit (compared to only the post-change snapshot), we could
>> potentially reuse some files that have been rewritten.
>>
>> > I'm skeptical that there is a benefit to implementing the set of data
>> operations from the Java API
>>
>> +1, the current Java API might be a bit redundant, some APIs serve very
>> similar purposes. I feel the important data actions to have from the end
>> user's perspective are basically the ability to (1) AddRows, (2)
>> DeleteRows?
>>
>> -Jack
>>
>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks, Drew.
>>>
>>> I think it's a good idea in general to be able to perform commits on the
>>> server-side, but I would much rather break this down into smaller parts. I
>>> would definitely want to start with just file append use cases, since I
>>> think that is the biggest win. It can reduce retries and is an easy way to
>>> write from non-JVM languages or just simpler applications.
>>>
>>> I'm skeptical that there is a benefit to implementing the set of data
>>> operations from the Java API. That's primarily because I don't think that
>>> use case 1 (better conflict resolution) is actually achieved. You can avoid
>>> retries on the client, but the retries must happen _somewhere_. The
>>> proposal is to roll back rewrite commits, but that's already possible with
>>> the much simpler API that exists today. Maybe I'm missing something?
>>>
>>> Even if I'm mistaken about being able to improve conflict resolution, I
>>> think that there is quite a bit of work here and I'd break this down either
>>> way. Starting with append use cases makes a lot of sense to me, but I'm
>>> interested to hear what others think as well.
>>>
>>> Ryan
>>>
>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <d...@amazon.com.invalid>
>>> wrote:
>>>
>>>> In regards to the multiple emails sent earlier, please use this one for
>>>> discussions.
>>>>
>>>> Thanks you!
>>>>
>>>>
>>>> On 2023/12/07 00:47:42 Drew wrote:
>>>> > Hi everyone,
>>>> >
>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at
>>>> Amazon EMR
>>>> > and Athena. I’m reaching out to share a proposal that introduces data
>>>> > commits as a part of the RESTCatalog. The current process for data
>>>> commits
>>>> > lives on the client side, and by shifting this logic into the REST
>>>> catalog,
>>>> > we can empower the catalog service with more control of this process.
>>>> >
>>>> > This proposal addresses specific use cases that showcase the benefits
>>>> of
>>>> > moving the commit logic to the service side. For instance, this shift
>>>> > allows the user to refine conflict resolution mechanisms, giving
>>>> precedence
>>>> > to operations that modify the table state to ensure their completion
>>>> > without conflict. Furthermore, our POC demonstrated an improvement in
>>>> the
>>>> > success rate of concurrent write operations against the GlueCatalog.
>>>> This
>>>> > all can be found in the detailed proposal below. Feel free to
>>>> comment, and
>>>> > add your suggestions!
>>>> >
>>>> > Detailed proposal:
>>>> >
>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing
>>>> > Github POC: https://github.com/apache/iceberg/pull/9237
>>>> >
>>>> > Looking forward to hearing back
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Drew Gallardo
>>>> > Amazon EMR & Athena
>>>> > d...@amazon.com
>>>> >
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Proposal for RESTful Data Operations

Reply via email to