Re: Proposal for REST APIs for Iceberg table scans

2024-06-19 Thread Jean-Baptiste Onofré
ughts, appreciate it! >>>> >>>> >>>> >>>> *From: *Ryan Blue >>>> *Reply-To: *"dev@iceberg.apache.org" >>>> *Date: *Wednesday, January 31, 2024 at 10:22 AM >>>> *To: *"dev@iceberg.apache.org"

Re: Proposal for REST APIs for Iceberg table scans

2024-06-18 Thread Ryan Blue
their thoughts, appreciate it! >>> >>> >>> >>> *From: *Ryan Blue >>> *Reply-To: *"dev@iceberg.apache.org" >>> *Date: *Wednesday, January 31, 2024 at 10:22 AM >>> *To: *"dev@iceberg.apache.org" >>> *Subjec

Re: Proposal for REST APIs for Iceberg table scans

2024-05-20 Thread Jack Ye
he spec. >> Thanks to everyone for sharing their thoughts, appreciate it! >> >> >> >> *From: *Ryan Blue >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Wednesday, January 31, 2024 at 10:22 AM >> *To: *"dev@iceberg.apac

Re: Proposal for REST APIs for Iceberg table scans

2024-05-18 Thread Pucheng Yang
uot;dev@iceberg.apache.org" > *Date: *Wednesday, January 31, 2024 at 10:22 AM > *To: *"dev@iceberg.apache.org" > *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table scans > > > > *CAUTION*: This email originated from outside of the organization. D

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Chertara, Rahil
;: 1000, "manifest": { "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] }, { ... } ]} POST /v1/namespaces/ns/tables/t/scan { "filter": {"type": "in", "term": "x", "values&qu

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Ryan Blue
sks should be returned by a "plan" endpoint and the manifest plan tasks >>>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else >>>>>> like the names "pre-plan" and "plan" better? >>>>>> &

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Jack Ye
>>>>>> Hi All hope everyone is doing well, >>>>>> >>>>>> >>>>>> Wanted to revive the discussion around the Rest Table Scan API work. >>>>>> For a refresher here is the original proposal: >>>>>> https

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Ryan Blue
w from >>>>> what was discussed. >>>>> >>>>> >>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in", >>>>> "term": "x", "values": [1, 2, 3] }, "select

Re: Proposal for REST APIs for Iceberg table scans

2024-01-31 Thread Daniel Weeks
/tables/t/plan *{ "filter": { "type": "in", >>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>> >>>> { "manifest-plan-tasks": [ >>>> { "start":

Re: Proposal for REST APIs for Iceberg table scans

2024-01-30 Thread Jack Ye
e/manifest.avro", ...}, "delete-manifests": [...] }, >>> { ... } >>> ]} >>> >>> >>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>> "term": "x", "v

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Renjie Liu
ot;length": 1000, "manifest": { >> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } } >> >> { "file-scan-tasks": [...] } >> >> >> *POST /v1/namespaces/ns/tables/t/scan *{ "filter"

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Ryan Blue
gt; > > > However IIRC Micah and Renjie had some concerns around this stricter > structure as this can make it harder to evolve in the future, as well as > some potential scalability challenges for larger tables that have many > manifest files. (Feel free to expand further on the con

Re: Proposal for REST APIs for Iceberg table scans

2024-01-29 Thread Chertara, Rahil
is stricter structure as this can make it harder to evolve in the future, as well as some potential scalability challenges for larger tables that have many manifest files. (Feel free to expand further on the concerns if my understanding is incorrect). Would appreciate if the community can leave

Re: Proposal for REST APIs for Iceberg table scans

2023-12-21 Thread Renjie Liu
I share the same concern with Micah. The shard detail should be implementation details of the server, rather than exposing directly to the client. If the goal is to make things stateless, we just need to attach a snapshot id + shard id, then a determined algorithm is supposed to give the same resul

Re: Proposal for REST APIs for Iceberg table scans

2023-12-20 Thread Micah Kornfield
> > Also +1 for having a more strict definition of the shard. Having arbitrary > JSON was basically what we experimented with a string shard ID, and we > ended up with something very similar to the manifest plan task you describe > in the serialized ID string. IIUC the proposal correctly, I'd act

Re: Proposal for REST APIs for Iceberg table scans

2023-12-19 Thread Jack Ye
+1 for having /plan and /scan, sounds like a good idea to separate those 2 distinct actions. Also +1 for having a more strict definition of the shard. Having arbitrary JSON was basically what we experimented with a string shard ID, and we ended up with something very similar to the manifest plan t

Re: Proposal for REST APIs for Iceberg table scans

2023-12-14 Thread Ryan Blue
The tasks might look something like this: CombinedPlanTask - List ManifestPlanTask - int start - int length - ManifestFile dataManifest - List deleteManifests On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue wrote: > Seems like that track has expired (This Internet-Draft will expire on 13 > May 2022)

Re: Proposal for REST APIs for Iceberg table scans

2023-12-14 Thread Ryan Blue
Seems like that track has expired (This Internet-Draft will expire on 13 May 2022) Yeah, looks like we should just use POST. That’s too bad. QUERY seems like a good idea to me. Distinguish planning using shard or not I think this was a mistake on my part. I was still thinking that we would have

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Renjie Liu
About the pagination part, I did some investigation and found that openapi doesn't have spec about streaming responses, but it's actually implementation detail. There are several ways to implement json streaming , and also there is an rfc

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
Seems like that track has expired (This Internet-Draft will expire on 13 May 2022), not sure how these RFCs are managed, but it does not seem hopeful to have this verb in. I think people are mostly using POST for this use case already. But overall I think we are in agreement with the general direc

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Ryan Blue
I just changed it to POST after looking into support for the QUERY method. It's a new HTTP method for cases like this where you don't want to pass everything through query params. Here's the QUERY method RFC , but I gues

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
Thanks, the Gist explains a lot of things. This is actually very close to our way of implementing the shard ID, we were defining the shard ID as a string, and the string content is actually something similar to the information of the JSON payload you showed, so we can persist minimum information in

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Ryan Blue
Jack, It sounds like what I’m proposing isn’t quite clear because your initial response was arguing for a sharding capability. I agree that sharding is a good idea. I’m less confident about two points: 1. Requiring that the service is stateful. As Renjie pointed out, that makes it harder to

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
After looking around, it seems like compared to OpenAPI, the AsyncAPI protocol (https://www.asyncapi.com/) could be a better option to describe streaming APIs. That might be one potential option, just put it out here. -Jack On Wed, Dec 13, 2023 at 11:52 AM Jack Ye wrote: > The current proposal

Re: Proposal for REST APIs for Iceberg table scans

2023-12-13 Thread Jack Ye
The current proposal definitely makes the server stateful. In our prototype we used other components like DynamoDB to keep track of states. If keeping it stateless is a tenant we can definitely make the proposal closer to that direction. Maybe one thing to make sure is, is this a core tenant of the

Re: Proposal for REST APIs for Iceberg table scans

2023-12-12 Thread Renjie Liu
Hi, Rahi and Jack: Thanks for raising this. My question is that the pagination and sharding will make the rest server stateful, e.g. a sequence of calls is required to go to the same server. In this case, how do we ensure the scalability of the rest server? On Wed, Dec 13, 2023 at 4:09 AM Fokko

Re: Proposal for REST APIs for Iceberg table scans

2023-12-12 Thread Fokko Driesprong
Hey Rahil and Jack, Thanks for bringing this up. Ryan and I also discussed this briefly in the early days of PyIceberg and it would have helped a lot in the speed of development. We went for the traditional approach because that would also support all the other catalogs, but now that the REST cata

Re: Proposal for REST APIs for Iceberg table scans

2023-12-11 Thread Jack Ye
Hi Ryan, thanks for the feedback! I was a part of this design discussion internally and can provide more details. One reason for separating the CreateScan operation was to make the API asynchronous and thus keep HTTP communications short. Consider the case where we only have GetScanTasks API, and

Re: Proposal for REST APIs for Iceberg table scans

2023-12-10 Thread Ryan Blue
Rahil, thanks for working on this. It has some really good ideas that we hadn't considered before like a way for the service to plan how to break up the work of scan planning. I really like that idea because it makes it much easier for the service to keep memory consumption low across requests. My

Proposal for REST APIs for Iceberg table scans

2023-12-08 Thread Chertara, Rahil
Hi all, My name is Rahil Chertara, and I’m a part of the Iceberg team at Amazon EMR and Athena. I’m reaching out to share a proposal for a new Scan API that will be utilized by the RESTCatalog. The process for table scan planning is currently done within client engines such as Apache Spark. By