Re: Proposal for RESTful Data Operations

Jack Ye Tue, 20 Feb 2024 15:28:43 -0800

I think there is also a point we were discussing but never closed regarding
AppendDeleteFiles, if that should be supported. The recent development in
Kafka, and vendor products like Upsolver Zero-ETL
<https://www.upsolver.com/blog/upsolver-announces-zero-etl-and-lakehouse-optimization-for-apache-iceberg>
seems to suggest that there is a demand for people to also just
append/stream deletes to a table. So I think it would be ideal that we can
support both data and delete files if we create a new append API.


For RemoveDataFiles and RemoveDeleteFiles, does that mean we need another
tables/{table}/remove endpoint? Also, how would such endpoints work with
multi-table transactions? Let's think through those points.

-Jack

On Tue, Feb 20, 2024 at 3:13 PM Drew <img...@gmail.com> wrote:

> Hi everyone,
>
> As we are discussing the rest spec changes to add support for DataFiles
> and DeleteFiles for both appends and scan planning API (PR:
> https://github.com/apache/iceberg/pull/9717). One thing that came up for
> appends was that this logic shouldn’t be in the table update API but
> instead it should have a dedicated endpoint. This would be beneficial for a
> few use cases such as, asynchronous appends, and batch commit support.
>
> I’d like to start a discussion on thoughts around introducing this new
> endpoint and its functionality to support the ongoing fine-grained metadata
> commit efforts. From the discussion in the ContentFile spec change PR, the
> proposed endpoint was envisioned as an append update handling update
> requests asynchronously. The link to that discussion can be found here:
> https://github.com/apache/iceberg/pull/9717#discussion_r1495005890. The
> proposed changes include:
>
> *Endpoint*:
> POST /v1/{prefix}/namespaces/{namespace}/tables/{table}/append
>
> *Request*:
> {
>   "accept-delay-ms": 300000, // acceptable delay for processing
>   "data-files": [...]
> }
>
> *Response*:
> 202 accepted
> {
>    “location”:
> “/v1/{prefix}/namespaces/{namespace}/tables/{table}/status/{id}“ // used to
> track status
> }
>
> I'm interested in gathering your thoughts on the asynchronous operation
> model and the suggested endpoint structure.
>
> Building on this, we previously discussed having these update options:
> RemoveDataFiles and RemoveDeleteFiles. Given this new endpoint structure,
> we should consider whether or not we should have unified or separate
> endpoints for these operations. For instance, should we organize these
> under a shared endpoint and specify operationType, or do establish distinct
> endpoints for these operations? Given that appends can support batch
> processing, we can accommodate this in the request model.
>
> Thank you,
> Drew
>
> On Fri, Jan 26, 2024 at 5:06 PM Drew <img...@gmail.com> wrote:
>
>> Hey everyone,
>>
>> I wanted to provide a quick update on the progress of the commit API
>> proposal. Based on the feedback in the design doc and the Slack
>> conversation with Dan and Jack, we've reached an agreement that this is
>> more of a fine-grained metadata commit, rather than a data operation or
>> commit. For the next steps, I'll be focusing on validating the requirements
>> for the update requests. Additionally, I'll be working on adding the
>> necessary tests to ensure its end-to-end functionality.
>>
>> Thanks for all the feedback, I still have an open PR for an appendFiles.
>> If you have a chance to review, I would appreciate any additional feedback
>> you may have.
>>
>> https://github.com/apache/iceberg/pull/9292
>>
>> Best,
>>
>> Drew
>>
>> On Fri, Jan 12, 2024 at 3:40 PM Drew <img...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I hope you all had great holidays! I wanted to resurface this proposal
>>> for RESTful Data operations.
>>>
>>> Currently, I have a open PR here:
>>> https://github.com/apache/iceberg/pull/9292
>>>
>>> Thanks,
>>> Drew
>>>
>>> On Wed, Dec 13, 2023 at 3:04 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Thanks Drew for the quick turnaround, I will take a deeper look into
>>>> the PR.
>>>>
>>>> I think if we all agree that it is beneficial to have the
>>>> AppendFIles(DataFile[]) API (maybe we should call it AppendRows instead), I
>>>> would like to know if it also makes sense to have:
>>>> 1. DeleteRows(DeleteFile[]), which can allow users to describe the
>>>> deletion of rows easily through the equality delete spec
>>>> 2. combine the 2 APIs of AppendRows and DeleteRows to one single type
>>>> of action
>>>>
>>>> I find it pretty intuitive from a user perspective to express deletion
>>>> of rows and commit them through equality deletes, and it would allow
>>>> performing updates through simple applications.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Dec 13, 2023 at 2:22 PM Drew <img...@gmail.com> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> Thanks for the feedback, I'll start going through the comments left in
>>>>> the doc! You're right in pointing out that the logic here can be 
>>>>> simplified
>>>>> to roll back a commit. For now I introduced a smaller PR, that focuses on
>>>>> the append files operation.
>>>>>
>>>>> Github PR: https://github.com/apache/iceberg/pull/9292
>>>>> Drew
>>>>>
>>>>>
>>>>> On Mon, Dec 11, 2023 at 11:33 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> > Based on my understanding of the proposal, I think it's more about
>>>>>> the possibility of enabling other ways that do not require a full 
>>>>>> rollback.
>>>>>> it's just currently we implemented it as a rollback to prove the
>>>>>> feasibility.
>>>>>>
>>>>>> My main question is this: what can be done besides rolling back a
>>>>>> commit? And why does that require 5 extra routes and metadata writes from
>>>>>> the REST service?
>>>>>>
>>>>>> On Mon, Dec 11, 2023 at 11:27 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> > The proposal is to roll back rewrite commits, but that's already
>>>>>>> possible with the much simpler API that exists today.
>>>>>>>
>>>>>>> Based on my understanding of the proposal, I think it's more about
>>>>>>> the possibility of enabling other ways that do not require a full 
>>>>>>> rollback.
>>>>>>> it's just currently we implemented it as a rollback to prove the
>>>>>>> feasibility. But given that now we have full access to the changes of 
>>>>>>> each
>>>>>>> data commit (compared to only the post-change snapshot), we could
>>>>>>> potentially reuse some files that have been rewritten.
>>>>>>>
>>>>>>> > I'm skeptical that there is a benefit to implementing the set of
>>>>>>> data operations from the Java API
>>>>>>>
>>>>>>> +1, the current Java API might be a bit redundant, some APIs serve
>>>>>>> very similar purposes. I feel the important data actions to have from 
>>>>>>> the
>>>>>>> end user's perspective are basically the ability to (1) AddRows, (2)
>>>>>>> DeleteRows?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Fri, Dec 8, 2023 at 5:01 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Thanks, Drew.
>>>>>>>>
>>>>>>>> I think it's a good idea in general to be able to perform commits
>>>>>>>> on the server-side, but I would much rather break this down into 
>>>>>>>> smaller
>>>>>>>> parts. I would definitely want to start with just file append use 
>>>>>>>> cases,
>>>>>>>> since I think that is the biggest win. It can reduce retries and is an 
>>>>>>>> easy
>>>>>>>> way to write from non-JVM languages or just simpler applications.
>>>>>>>>
>>>>>>>> I'm skeptical that there is a benefit to implementing the set of
>>>>>>>> data operations from the Java API. That's primarily because I don't 
>>>>>>>> think
>>>>>>>> that use case 1 (better conflict resolution) is actually achieved. You 
>>>>>>>> can
>>>>>>>> avoid retries on the client, but the retries must happen _somewhere_. 
>>>>>>>> The
>>>>>>>> proposal is to roll back rewrite commits, but that's already possible 
>>>>>>>> with
>>>>>>>> the much simpler API that exists today. Maybe I'm missing something?
>>>>>>>>
>>>>>>>> Even if I'm mistaken about being able to improve conflict
>>>>>>>> resolution, I think that there is quite a bit of work here and I'd 
>>>>>>>> break
>>>>>>>> this down either way. Starting with append use cases makes a lot of 
>>>>>>>> sense
>>>>>>>> to me, but I'm interested to hear what others think as well.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Fri, Dec 8, 2023 at 4:34 PM Gallardo, Drew
>>>>>>>> <d...@amazon.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> In regards to the multiple emails sent earlier, please use this
>>>>>>>>> one for discussions.
>>>>>>>>>
>>>>>>>>> Thanks you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2023/12/07 00:47:42 Drew wrote:
>>>>>>>>> > Hi everyone,
>>>>>>>>> >
>>>>>>>>> > My name is Drew Gallardo, and I’m a part of the Iceberg team at
>>>>>>>>> Amazon EMR
>>>>>>>>> > and Athena. I’m reaching out to share a proposal that introduces
>>>>>>>>> data
>>>>>>>>> > commits as a part of the RESTCatalog. The current process for
>>>>>>>>> data commits
>>>>>>>>> > lives on the client side, and by shifting this logic into the
>>>>>>>>> REST catalog,
>>>>>>>>> > we can empower the catalog service with more control of this
>>>>>>>>> process.
>>>>>>>>> >
>>>>>>>>> > This proposal addresses specific use cases that showcase the
>>>>>>>>> benefits of
>>>>>>>>> > moving the commit logic to the service side. For instance, this
>>>>>>>>> shift
>>>>>>>>> > allows the user to refine conflict resolution mechanisms, giving
>>>>>>>>> precedence
>>>>>>>>> > to operations that modify the table state to ensure their
>>>>>>>>> completion
>>>>>>>>> > without conflict. Furthermore, our POC demonstrated an
>>>>>>>>> improvement in the
>>>>>>>>> > success rate of concurrent write operations against the
>>>>>>>>> GlueCatalog. This
>>>>>>>>> > all can be found in the detailed proposal below. Feel free to
>>>>>>>>> comment, and
>>>>>>>>> > add your suggestions!
>>>>>>>>> >
>>>>>>>>> > Detailed proposal:
>>>>>>>>> >
>>>>>>>>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit?usp=sharing
>>>>>>>>> > Github POC: https://github.com/apache/iceberg/pull/9237
>>>>>>>>> >
>>>>>>>>> > Looking forward to hearing back
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> >
>>>>>>>>> > Drew Gallardo
>>>>>>>>> > Amazon EMR & Athena
>>>>>>>>> > d...@amazon.com
>>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>

Re: Proposal for RESTful Data Operations

Reply via email to