Re: Proposal for REST APIs for Iceberg table scans

2024-06-19 Thread Jean-Baptiste Onofré
Hi Ryan, You are right: I can't access the document either. AFAIR, Jack did the doc, he will fix that soon I'm sure. Regards JB On Wed, Jun 19, 2024 at 1:17 AM Ryan Blue wrote: > It looks like the design doc from the original email is no longer > available. Could someone fix the permissions? >

Re: Proposal for REST APIs for Iceberg table scans

2024-06-18 Thread Ryan Blue
It looks like the design doc from the original email is no longer available. Could someone fix the permissions? On Mon, May 20, 2024 at 8:10 AM Jack Ye wrote: > We merged the spec change for content file in > https://github.com/apache/iceberg/pull/9717, the next step is to merge > the PlanTable

Re: Proposal for REST APIs for Iceberg table scans

2024-05-20 Thread Jack Ye
We merged the spec change for content file in https://github.com/apache/iceberg/pull/9717, the next step is to merge the PlanTable and PreplanTable API spec change in https://github.com/apache/iceberg/pull/9695. I guess people were a bit busy in the past few weeks due to the Iceberg summit, you sho

Re: Proposal for REST APIs for Iceberg table scans

2024-05-18 Thread Pucheng Yang
Hi all, I wonder if we have a ETA for this change? thanks On Wed, Jan 31, 2024 at 10:30 AM Chertara, Rahil wrote: > Sure, I can look into adding this to the spec. > Thanks to everyone for sharing their thoughts, appreciate it! > > > > *From: *Ryan Blue > *Reply-To: *"dev@iceberg.apache.org" >

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Chertara, Rahil
Sure, I can look into adding this to the spec. Thanks to everyone for sharing their thoughts, appreciate it! From: Ryan Blue Reply-To: "dev@iceberg.apache.org" Date: Wednesday, January 31, 2024 at 10:22 AM To: "dev@iceberg.apache.org" Subject: RE: [EXTERNAL] Proposal for REST APIs for Iceberg t

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Ryan Blue
Looks good to me! Should we get a PR up to add it to the OpenAPI spec? On Wed, Jan 31, 2024 at 10:16 AM Jack Ye wrote: > Sounds good. I don't really have any strong opinions here. So looks like > we are landing on this? > > > *PreplanTable: POST /v1/namespaces/ns/tables/t/preplan*{ "filter": { >

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Jack Ye
Sounds good. I don't really have any strong opinions here. So looks like we are landing on this? *PreplanTable: POST /v1/namespaces/ns/tables/t/preplan*{ "filter": { "type": "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"] } { "plan-tasks": [ { ... }, { ... } ] } // opaque objec

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Ryan Blue
I agree with Dan. I'd rather have two endpoints instead of needing an option that changes the behavior entirely in the same route. I don't think that a `preplan` route would be too bad. On Wed, Jan 31, 2024 at 9:51 AM Daniel Weeks wrote: > I agree with the opaque tokens. > > However, I'm concern

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Daniel Weeks
I agree with the opaque tokens. However, I'm concerned we're overloading the endpoint two perform two distinctly different operations: distribute a plan and scan a plan. Changing the task-type then changes the behavior and the result. I feel it would be more straightforward to separate the distr

Re: Proposal for REST APIs for Iceberg table scans

2024-01-30 Thread Jack Ye
+1 for having the opaque plan tasks, that's probably the most flexible way forward. And let's call them *plan tasks* going forward to standardize the terminology. I think the name of the APIs can be determined based on the actual API shape. For example, if we centralize these 2 plan and pre-plan a

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Renjie Liu
> > But to move forward, I think we should go with the option that preserves > flexibility. I think the spec should state that plan tasks (if we call them > that) are a JSON object that should be sent as-is back to the REST service > to be used. +1 for this. > One more thing that I would also ch

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Ryan Blue
As you noted the main point we still need to decide on is whether to have a standard "shard" definition (e.g. manifest plan task) or to allow it to be opaque and specific to catalogs implementing the protocol. I've not replied because I keep coming back to this decision and I'm not sure whether the

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Chertara, Rahil
Hi All hope everyone is doing well, Wanted to revive the discussion around the Rest Table Scan API work. For a refresher here is the original proposal: https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h as well as the PR: https://github.c

Re: Proposal for REST APIs for Iceberg table scans

2023-12-21 Thread Renjie Liu
I share the same concern with Micah. The shard detail should be implementation details of the server, rather than exposing directly to the client. If the goal is to make things stateless, we just need to attach a snapshot id + shard id, then a determined algorithm is supposed to give the same resul

Re: Proposal for REST APIs for Iceberg table scans

2023-12-20 Thread Micah Kornfield
> > Also +1 for having a more strict definition of the shard. Having arbitrary > JSON was basically what we experimented with a string shard ID, and we > ended up with something very similar to the manifest plan task you describe > in the serialized ID string. IIUC the proposal correctly, I'd act

Re: Proposal for REST APIs for Iceberg table scans

2023-12-19 Thread Jack Ye
+1 for having /plan and /scan, sounds like a good idea to separate those 2 distinct actions. Also +1 for having a more strict definition of the shard. Having arbitrary JSON was basically what we experimented with a string shard ID, and we ended up with something very similar to the manifest plan t

Re: Proposal for REST APIs for Iceberg table scans

2023-12-14 Thread Ryan Blue
The tasks might look something like this: CombinedPlanTask - List ManifestPlanTask - int start - int length - ManifestFile dataManifest - List deleteManifests On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue wrote: > Seems like that track has expired (This Internet-Draft will expire on 13 > May 2022)

Re: Proposal for REST APIs for Iceberg table scans

2023-12-14 Thread Ryan Blue
Seems like that track has expired (This Internet-Draft will expire on 13 May 2022) Yeah, looks like we should just use POST. That’s too bad. QUERY seems like a good idea to me. Distinguish planning using shard or not I think this was a mistake on my part. I was still thinking that we would have

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Renjie Liu
About the pagination part, I did some investigation and found that openapi doesn't have spec about streaming responses, but it's actually implementation detail. There are several ways to implement json streaming , and also there is an rfc

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
Seems like that track has expired (This Internet-Draft will expire on 13 May 2022), not sure how these RFCs are managed, but it does not seem hopeful to have this verb in. I think people are mostly using POST for this use case already. But overall I think we are in agreement with the general direc

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Ryan Blue
I just changed it to POST after looking into support for the QUERY method. It's a new HTTP method for cases like this where you don't want to pass everything through query params. Here's the QUERY method RFC , but I gues

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
Thanks, the Gist explains a lot of things. This is actually very close to our way of implementing the shard ID, we were defining the shard ID as a string, and the string content is actually something similar to the information of the JSON payload you showed, so we can persist minimum information in

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Ryan Blue
Jack, It sounds like what I’m proposing isn’t quite clear because your initial response was arguing for a sharding capability. I agree that sharding is a good idea. I’m less confident about two points: 1. Requiring that the service is stateful. As Renjie pointed out, that makes it harder to

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
After looking around, it seems like compared to OpenAPI, the AsyncAPI protocol (https://www.asyncapi.com/) could be a better option to describe streaming APIs. That might be one potential option, just put it out here. -Jack On Wed, Dec 13, 2023 at 11:52 AM Jack Ye wrote: > The current proposal

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
The current proposal definitely makes the server stateful. In our prototype we used other components like DynamoDB to keep track of states. If keeping it stateless is a tenant we can definitely make the proposal closer to that direction. Maybe one thing to make sure is, is this a core tenant of the

Re: Proposal for REST APIs for Iceberg table scans

2023-12-12 Thread Renjie Liu
Hi, Rahi and Jack: Thanks for raising this. My question is that the pagination and sharding will make the rest server stateful, e.g. a sequence of calls is required to go to the same server. In this case, how do we ensure the scalability of the rest server? On Wed, Dec 13, 2023 at 4:09 AM Fokko

Re: Proposal for REST APIs for Iceberg table scans

2023-12-12 Thread Fokko Driesprong
Hey Rahil and Jack, Thanks for bringing this up. Ryan and I also discussed this briefly in the early days of PyIceberg and it would have helped a lot in the speed of development. We went for the traditional approach because that would also support all the other catalogs, but now that the REST cata

Re: Proposal for REST APIs for Iceberg table scans

2023-12-11 Thread Jack Ye
Hi Ryan, thanks for the feedback! I was a part of this design discussion internally and can provide more details. One reason for separating the CreateScan operation was to make the API asynchronous and thus keep HTTP communications short. Consider the case where we only have GetScanTasks API, and

Re: Proposal for REST APIs for Iceberg table scans

2023-12-10 Thread Ryan Blue
Rahil, thanks for working on this. It has some really good ideas that we hadn't considered before like a way for the service to plan how to break up the work of scan planning. I really like that idea because it makes it much easier for the service to keep memory consumption low across requests. My