Re: Proposal for RESTful Data Operations

Drew Tue, 20 Feb 2024 15:13:27 -0800

Hi everyone,

As we are discussing the rest spec changes to add support for DataFiles and
DeleteFiles for both appends and scan planning API (PR:
https://github.com/apache/iceberg/pull/9717). One thing that came up for
appends was that this logic shouldn’t be in the table update API but
instead it should have a dedicated endpoint. This would be beneficial for a
few use cases such as, asynchronous appends, and batch commit support.


I’d like to start a discussion on thoughts around introducing this new
endpoint and its functionality to support the ongoing fine-grained metadata
commit efforts. From the discussion in the ContentFile spec change PR, the
proposed endpoint was envisioned as an append update handling update
requests asynchronously. The link to that discussion can be found here:
https://github.com/apache/iceberg/pull/9717#discussion_r1495005890. The
proposed changes include:

*Endpoint*:
POST /v1/{prefix}/namespaces/{namespace}/tables/{table}/append

*Request*:
{
  "accept-delay-ms": 300000, // acceptable delay for processing
  "data-files": [...]
}

*Response*:
202 accepted
{
   “location”:
“/v1/{prefix}/namespaces/{namespace}/tables/{table}/status/{id}“ // used to
track status
}

I'm interested in gathering your thoughts on the asynchronous operation
model and the suggested endpoint structure.

Building on this, we previously discussed having these update options:
RemoveDataFiles and RemoveDeleteFiles. Given this new endpoint structure,
we should consider whether or not we should have unified or separate
endpoints for these operations. For instance, should we organize these
under a shared endpoint and specify operationType, or do establish distinct
endpoints for these operations? Given that appends can support batch
processing, we can accommodate this in the request model.

Thank you,
Drew

On Fri, Jan 26, 2024 at 5:06 PM Drew <img...@gmail.com> wrote:

> Hey everyone,
>
> I wanted to provide a quick update on the progress of the commit API
> proposal. Based on the feedback in the design doc and the Slack
> conversation with Dan and Jack, we've reached an agreement that this is
> more of a fine-grained metadata commit, rather than a data operation or
> commit. For the next steps, I'll be focusing on validating the requirements
> for the update requests. Additionally, I'll be working on adding the
> necessary tests to ensure its end-to-end functionality.
>
> Thanks for all the feedback, I still have an open PR for an appendFiles.
> If you have a chance to review, I would appreciate any additional feedback
> you may have.
>
> https://github.com/apache/iceberg/pull/9292
>
> Best,
>
> Drew
>
> On Fri, Jan 12, 2024 at 3:40 PM Drew <img...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I hope you all had great holidays! I wanted to resurface this proposal
>> for RESTful Data operations.
>>
>> Currently, I have a open PR here:
>> https://github.com/apache/iceberg/pull/9292
>>
>> Thanks,
>> Drew
>>
>> On Wed, Dec 13, 2023 at 3:04 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Thanks Drew for the quick turnaround, I will take a deeper look into the
>>> PR.
>>>
>>> I think if we all agree that it is beneficial to have the
>>> AppendFIles(DataFile[]) API (maybe we should call it AppendRows instead), I
>>> would like to know if it also makes sense to have:
>>> 1. DeleteRows(DeleteFile[]), which can allow users to describe the
>>> deletion of rows easily through the equality delete spec
>>> 2. combine the 2 APIs of AppendRows and DeleteRows to one single type of
>>> action
>>>
>>> I find it pretty intuitive from a user perspective to express deletion
>>> of rows and commit them through equality deletes, and it would allow
>>> performing updates through simple applications.
>>>
>>> -Jack
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 13, 2023 at 2:22 PM Drew <img...@gmail.com> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Thanks for the feedback, I'll start going through the comments left in
>>>> the doc! You're right in pointing out that the logic here can be simplified
>>>> to roll back a commit. For now I introduced a smaller PR, that focuses on
>>>> the append files operation.
>>>>
>>>> Github PR: https://github.com/apache/iceberg/pull/9292
>>>> Drew
>>>>
>>>>
>>>> On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> > Based on my understanding of the proposal, I think it's more about
>>>>> the possibility of enabling other ways that do not require a full 
>>>>> rollback.
>>>>> it's just currently we implemented it as a rollback to prove the
>>>>> feasibility.
>>>>>
>>>>> My main question is this: what can be done besides rolling back a
>>>>> commit? And why does that require 5 extra routes and metadata writes from
>>>>> the REST service?
>>>>>
>>>>> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> > The proposal is to roll back rewrite commits, but that's already
>>>>>> possible with the much simpler API that exists today.
>>>>>>
>>>>>> Based on my understanding of the proposal, I think it's more about
>>>>>> the possibility of enabling other ways that do not require a full 
>>>>>> rollback.
>>>>>> it's just currently we implemented it as a rollback to prove the
>>>>>> feasibility. But given that now we have full access to the changes of 
>>>>>> each
>>>>>> data commit (compared to only the post-change snapshot), we could
>>>>>> potentially reuse some files that have been rewritten.
>>>>>>
>>>>>> > I'm skeptical that there is a benefit to implementing the set of
>>>>>> data operations from the Java API
>>>>>>
>>>>>> +1, the current Java API might be a bit redundant, some APIs serve
>>>>>> very similar purposes. I feel the important data actions to have from the
>>>>>> end user's perspective are basically the ability to (1) AddRows, (2)
>>>>>> DeleteRows?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Thanks, Drew.
>>>>>>>
>>>>>>> I think it's a good idea in general to be able to perform commits on
>>>>>>> the server-side, but I would much rather break this down into smaller
>>>>>>> parts. I would definitely want to start with just file append use cases,
>>>>>>> since I think that is the biggest win. It can reduce retries and is an 
>>>>>>> easy
>>>>>>> way to write from non-JVM languages or just simpler applications.
>>>>>>>
>>>>>>> I'm skeptical that there is a benefit to implementing the set of
>>>>>>> data operations from the Java API. That's primarily because I don't 
>>>>>>> think
>>>>>>> that use case 1 (better conflict resolution) is actually achieved. You 
>>>>>>> can
>>>>>>> avoid retries on the client, but the retries must happen _somewhere_. 
>>>>>>> The
>>>>>>> proposal is to roll back rewrite commits, but that's already possible 
>>>>>>> with
>>>>>>> the much simpler API that exists today. Maybe I'm missing something?
>>>>>>>
>>>>>>> Even if I'm mistaken about being able to improve conflict
>>>>>>> resolution, I think that there is quite a bit of work here and I'd break
>>>>>>> this down either way. Starting with append use cases makes a lot of 
>>>>>>> sense
>>>>>>> to me, but I'm interested to hear what others think as well.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew <d...@amazon.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In regards to the multiple emails sent earlier, please use this one
>>>>>>>> for discussions.
>>>>>>>>
>>>>>>>> Thanks you!
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2023/12/07 00:47:42 Drew wrote:
>>>>>>>> > Hi everyone,
>>>>>>>> >
>>>>>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at
>>>>>>>> Amazon EMR
>>>>>>>> > and Athena. I’m reaching out to share a proposal that introduces
>>>>>>>> data
>>>>>>>> > commits as a part of the RESTCatalog. The current process for
>>>>>>>> data commits
>>>>>>>> > lives on the client side, and by shifting this logic into the
>>>>>>>> REST catalog,
>>>>>>>> > we can empower the catalog service with more control of this
>>>>>>>> process.
>>>>>>>> >
>>>>>>>> > This proposal addresses specific use cases that showcase the
>>>>>>>> benefits of
>>>>>>>> > moving the commit logic to the service side. For instance, this
>>>>>>>> shift
>>>>>>>> > allows the user to refine conflict resolution mechanisms, giving
>>>>>>>> precedence
>>>>>>>> > to operations that modify the table state to ensure their
>>>>>>>> completion
>>>>>>>> > without conflict. Furthermore, our POC demonstrated an
>>>>>>>> improvement in the
>>>>>>>> > success rate of concurrent write operations against the
>>>>>>>> GlueCatalog. This
>>>>>>>> > all can be found in the detailed proposal below. Feel free to
>>>>>>>> comment, and
>>>>>>>> > add your suggestions!
>>>>>>>> >
>>>>>>>> > Detailed proposal:
>>>>>>>> >
>>>>>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing
>>>>>>>> > Github POC: https://github.com/apache/iceberg/pull/9237
>>>>>>>> >
>>>>>>>> > Looking forward to hearing back
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Drew Gallardo
>>>>>>>> > Amazon EMR & Athena
>>>>>>>> > d...@amazon.com
>>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

Re: Proposal for RESTful Data Operations

Reply via email to