Re: Proposal for RESTful Data Operations

Drew Fri, 26 Jan 2024 17:06:46 -0800

Hey everyone,

I wanted to provide a quick update on the progress of the commit API
proposal. Based on the feedback in the design doc and the Slack
conversation with Dan and Jack, we've reached an agreement that this is
more of a fine-grained metadata commit, rather than a data operation or
commit. For the next steps, I'll be focusing on validating the requirements
for the update requests. Additionally, I'll be working on adding the
necessary tests to ensure its end-to-end functionality.


Thanks for all the feedback, I still have an open PR for an appendFiles. If
you have a chance to review, I would appreciate any additional feedback you
may have.

https://github.com/apache/iceberg/pull/9292

Best,

Drew

On Fri, Jan 12, 2024 at 3:40 PM Drew <[email protected]> wrote:

> Hi everyone,
>
> I hope you all had great holidays! I wanted to resurface this proposal for
> RESTful Data operations.
>
> Currently, I have a open PR here:
> https://github.com/apache/iceberg/pull/9292
>
> Thanks,
> Drew
>
> On Wed, Dec 13, 2023 at 3:04 PM Jack Ye <[email protected]> wrote:
>
>> Thanks Drew for the quick turnaround, I will take a deeper look into the
>> PR.
>>
>> I think if we all agree that it is beneficial to have the
>> AppendFIles(DataFile[]) API (maybe we should call it AppendRows instead), I
>> would like to know if it also makes sense to have:
>> 1. DeleteRows(DeleteFile[]), which can allow users to describe the
>> deletion of rows easily through the equality delete spec
>> 2. combine the 2 APIs of AppendRows and DeleteRows to one single type of
>> action
>>
>> I find it pretty intuitive from a user perspective to express deletion of
>> rows and commit them through equality deletes, and it would allow
>> performing updates through simple applications.
>>
>> -Jack
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Dec 13, 2023 at 2:22 PM Drew <[email protected]> wrote:
>>
>>> Hi Ryan,
>>>
>>> Thanks for the feedback, I'll start going through the comments left in
>>> the doc! You're right in pointing out that the logic here can be simplified
>>> to roll back a commit. For now I introduced a smaller PR, that focuses on
>>> the append files operation.
>>>
>>> Github PR: https://github.com/apache/iceberg/pull/9292
>>> Drew
>>>
>>>
>>> On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <[email protected]> wrote:
>>>
>>>> > Based on my understanding of the proposal, I think it's more about
>>>> the possibility of enabling other ways that do not require a full rollback.
>>>> it's just currently we implemented it as a rollback to prove the
>>>> feasibility.
>>>>
>>>> My main question is this: what can be done besides rolling back a
>>>> commit? And why does that require 5 extra routes and metadata writes from
>>>> the REST service?
>>>>
>>>> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <[email protected]> wrote:
>>>>
>>>>> > The proposal is to roll back rewrite commits, but that's already
>>>>> possible with the much simpler API that exists today.
>>>>>
>>>>> Based on my understanding of the proposal, I think it's more about the
>>>>> possibility of enabling other ways that do not require a full rollback.
>>>>> it's just currently we implemented it as a rollback to prove the
>>>>> feasibility. But given that now we have full access to the changes of each
>>>>> data commit (compared to only the post-change snapshot), we could
>>>>> potentially reuse some files that have been rewritten.
>>>>>
>>>>> > I'm skeptical that there is a benefit to implementing the set of
>>>>> data operations from the Java API
>>>>>
>>>>> +1, the current Java API might be a bit redundant, some APIs serve
>>>>> very similar purposes. I feel the important data actions to have from the
>>>>> end user's perspective are basically the ability to (1) AddRows, (2)
>>>>> DeleteRows?
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> Thanks, Drew.
>>>>>>
>>>>>> I think it's a good idea in general to be able to perform commits on
>>>>>> the server-side, but I would much rather break this down into smaller
>>>>>> parts. I would definitely want to start with just file append use cases,
>>>>>> since I think that is the biggest win. It can reduce retries and is an 
>>>>>> easy
>>>>>> way to write from non-JVM languages or just simpler applications.
>>>>>>
>>>>>> I'm skeptical that there is a benefit to implementing the set of data
>>>>>> operations from the Java API. That's primarily because I don't think that
>>>>>> use case 1 (better conflict resolution) is actually achieved. You can 
>>>>>> avoid
>>>>>> retries on the client, but the retries must happen _somewhere_. The
>>>>>> proposal is to roll back rewrite commits, but that's already possible 
>>>>>> with
>>>>>> the much simpler API that exists today. Maybe I'm missing something?
>>>>>>
>>>>>> Even if I'm mistaken about being able to improve conflict resolution,
>>>>>> I think that there is quite a bit of work here and I'd break this down
>>>>>> either way. Starting with append use cases makes a lot of sense to me, 
>>>>>> but
>>>>>> I'm interested to hear what others think as well.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> In regards to the multiple emails sent earlier, please use this one
>>>>>>> for discussions.
>>>>>>>
>>>>>>> Thanks you!
>>>>>>>
>>>>>>>
>>>>>>> On 2023/12/07 00:47:42 Drew wrote:
>>>>>>> > Hi everyone,
>>>>>>> >
>>>>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at
>>>>>>> Amazon EMR
>>>>>>> > and Athena. I’m reaching out to share a proposal that introduces
>>>>>>> data
>>>>>>> > commits as a part of the RESTCatalog. The current process for data
>>>>>>> commits
>>>>>>> > lives on the client side, and by shifting this logic into the REST
>>>>>>> catalog,
>>>>>>> > we can empower the catalog service with more control of this
>>>>>>> process.
>>>>>>> >
>>>>>>> > This proposal addresses specific use cases that showcase the
>>>>>>> benefits of
>>>>>>> > moving the commit logic to the service side. For instance, this
>>>>>>> shift
>>>>>>> > allows the user to refine conflict resolution mechanisms, giving
>>>>>>> precedence
>>>>>>> > to operations that modify the table state to ensure their
>>>>>>> completion
>>>>>>> > without conflict. Furthermore, our POC demonstrated an improvement
>>>>>>> in the
>>>>>>> > success rate of concurrent write operations against the
>>>>>>> GlueCatalog. This
>>>>>>> > all can be found in the detailed proposal below. Feel free to
>>>>>>> comment, and
>>>>>>> > add your suggestions!
>>>>>>> >
>>>>>>> > Detailed proposal:
>>>>>>> >
>>>>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing
>>>>>>> > Github POC: https://github.com/apache/iceberg/pull/9237
>>>>>>> >
>>>>>>> > Looking forward to hearing back
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> >
>>>>>>> > Drew Gallardo
>>>>>>> > Amazon EMR & Athena
>>>>>>> > [email protected]
>>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: Proposal for RESTful Data Operations

Reply via email to