> With start/limit each client can query for own's chunk without coordination.
Okay, I understand now. Would you need to parallelize the client for listing namespaces or tables? That seems odd to me. On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > > You can parallelize with opaque tokens by sending a starting point for > the next request. > > I meant we would have to wait for the server to return this starting point > from the past request? With start/limit each client can query for own's > chunk without coordination. > > On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote: > >> > I think start and offset has the advantage of being parallelizable (as >> compared to continuation tokens). >> >> You can parallelize with opaque tokens by sending a starting point for >> the next request. >> >> > On the other hand, using "asOf" can be complex to implement and may be >> too powerful for the pagination use case >> >> I don't think that we want to add `asOf`. If the service chooses to do >> this, it would send a continuation token that has the information embedded. >> >> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Can we assume it is the responsibility of the server to ensure >>> determinism (e.g., by caching the results along with query ID)? I think >>> start and offset has the advantage of being parallelizable (as compared to >>> continuation tokens). On the other hand, using "asOf" can be complex to >>> implement and may be too powerful for the pagination use case (because it >>> allows to query the warehouse as of any point of time, not just now). >>> >>> Thanks, >>> Walaa. >>> >>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I think you can solve the atomicity problem with a continuation token >>>> and server-side state. In general, I don't think this is a problem we >>>> should worry about a lot since pagination commonly has this problem. But >>>> since we can build a system that allows you to solve it if you choose to, >>>> we should go with that design. >>>> >>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>>> Hi Jack, >>>>> Some answers inline. >>>>> >>>>> >>>>>> In addition to the start index approach, another potential simple way >>>>>> to implement the continuation token is to use the last item name, when >>>>>> the >>>>>> listing is guaranteed to be in lexicographic order. >>>>> >>>>> >>>>> I think this is one viable implementation, but the reason that the >>>>> token should be opaque is that it allows several different implementations >>>>> without client side changes. >>>>> >>>>> For example, if an element is added before the continuation token, >>>>>> then all future listing calls with the token would always skip that >>>>>> element. >>>>> >>>>> >>>>> IMO, I think this is fine, for some of the REST APIs it is likely >>>>> important to put constraints on atomicity requirements, for others (e.g. >>>>> list namespaces) I think it is OK to have looser requirements. >>>>> >>>>> If we want to enforce that level of atomicity, we probably want to >>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) >>>>>> to >>>>>> ensure that we are listing results at a specific point of time of the >>>>>> warehouse, so the complete result list is fixed. >>>>> >>>>> >>>>> Time travel might be useful in some cases but I think it is orthogonal >>>>> to services wishing to have guarantees around atomicity/consistency of >>>>> results. If a server wants to ensure that results are atomic/consistent >>>>> as >>>>> of the start of the listing, it can embed the necessary timestamp in the >>>>> token it returns and parse it out when fetching the next result. >>>>> >>>>> I think this does raise a more general point around service definition >>>>> evolution in general. I think there likely need to be metadata endpoints >>>>> that expose either: >>>>> 1. A version of the REST API supported. >>>>> 2. Features the API supports (e.g. which query parameters are honored >>>>> for a specific endpoint). >>>>> >>>>> There are pros and cons to both approaches (apologies if I missed this >>>>> in the spec or if it has already been discussed). >>>>> >>>>> Cheers, >>>>> Micah >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> Yes I agree that it is better to not enforce the implementation to >>>>>> favor any direction, and continuation token is probably better than >>>>>> enforcing a numeric start index. >>>>>> >>>>>> In addition to the start index approach, another potential simple way >>>>>> to implement the continuation token is to use the last item name, when >>>>>> the >>>>>> listing is guaranteed to be in lexicographic order. Compared to the start >>>>>> index approach, it does not need to worry about the change of start index >>>>>> when something in the list is added or removed. >>>>>> >>>>>> However, the issue of concurrent modification could still exist even >>>>>> with a continuation token. For example, if an element is added before the >>>>>> continuation token, then all future listing calls with the token would >>>>>> always skip that element. If we want to enforce that level of atomicity, >>>>>> we >>>>>> probably want to introduce another time travel query parameter (e.g. >>>>>> asOf=1703003028000) to ensure that we are listing results at a specific >>>>>> point of time of the warehouse, so the complete result list is fixed. >>>>>> (This >>>>>> is also the missing piece I forgot to mention in the start index approach >>>>>> to ensure it works in distributed settings) >>>>>> >>>>>> -Jack >>>>>> >>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I tried to cover these in more details at: >>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>>>>>> >>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 for this approach. I agree that the streaming approach requires >>>>>>>> that http client and servers have http 2 streaming support, which is >>>>>>>> not >>>>>>>> compatible with old clients. >>>>>>>> >>>>>>>> I share the same concern with Micah that only start/limit may not >>>>>>>> be enough in a distributed environment where modification happens >>>>>>>> during >>>>>>>> iterations. For compatibility, we need to consider several cases: >>>>>>>> >>>>>>>> 1. Old client <-> New Server >>>>>>>> 2. New client <-> Old server >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I agree that we want to include this feature and I raised similar >>>>>>>>> concerns to what Micah already presented in talking with Ryan. >>>>>>>>> >>>>>>>>> For backward compatibility, just adding a start and limit >>>>>>>>> implies a deterministic order, which is not a current requirement of >>>>>>>>> the >>>>>>>>> REST spec. >>>>>>>>> >>>>>>>>> Also, we need to consider whether the start/limit would need to be >>>>>>>>> respected by the server. If existing implementations simply return >>>>>>>>> all the >>>>>>>>> results, will that be sufficient? There are a few edge cases that >>>>>>>>> need to >>>>>>>>> be considered here. >>>>>>>>> >>>>>>>>> For the opaque key approach, I think adding a query param to >>>>>>>>> trigger/continue and introducing a continuation token in >>>>>>>>> the ListNamespacesResponse might allow for more backward >>>>>>>>> compatibility. In >>>>>>>>> that scenario, pagination would only take place for clients who know >>>>>>>>> how to >>>>>>>>> paginate and the ordering would not need to be deterministic. >>>>>>>>> >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield < >>>>>>>>> emkornfi...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Just to clarify and add a small suggestion: >>>>>>>>>> >>>>>>>>>> The behavior with no additional parameters requires the >>>>>>>>>> operations to happen as they do today for backwards compatibility >>>>>>>>>> (i.e >>>>>>>>>> either all responses are returned or a failure occurs). >>>>>>>>>> >>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead of >>>>>>>>>> specific numeric offset) that can be returned by the service and a >>>>>>>>>> limit >>>>>>>>>> (as proposed above). If a start token is provided without a limit a >>>>>>>>>> default limit can be chosen by the server. Servers might return >>>>>>>>>> less than >>>>>>>>>> limit (i.e. clients are required to check for a next token to >>>>>>>>>> determine if >>>>>>>>>> iteration is complete). This enables server side state if it is >>>>>>>>>> desired >>>>>>>>>> but also makes deterministic listing much more feasible >>>>>>>>>> (deterministic >>>>>>>>>> responses are essentially impossible in the face of changing data if >>>>>>>>>> only a >>>>>>>>>> start offset is provided). >>>>>>>>>> >>>>>>>>>> In an ideal world, specifying a limit would result in streaming >>>>>>>>>> responses being returned with the last part either containing a >>>>>>>>>> token if >>>>>>>>>> continuation is necessary. Given conversation on the other thread of >>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API >>>>>>>>>> REST >>>>>>>>>> service. >>>>>>>>>> >>>>>>>>>> Therefore it seems like using pagination with token and offset >>>>>>>>>> would be preferred. If skipping someplace in the middle of the >>>>>>>>>> namespaces >>>>>>>>>> is required then I would suggest modelling those as first class query >>>>>>>>>> parameters (e.g. "startAfterNamespace") >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Micah >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 for this approach >>>>>>>>>>> >>>>>>>>>>> I think it's good to use query params because it can be >>>>>>>>>>> backward-compatible with the current behavior. If you get more than >>>>>>>>>>> the >>>>>>>>>>> limit back, then the service probably doesn't support pagination. >>>>>>>>>>> And if a >>>>>>>>>>> client doesn't support pagination they get the same results that >>>>>>>>>>> they would >>>>>>>>>>> today. A streaming approach with a continuation link like in the >>>>>>>>>>> scan API >>>>>>>>>>> discussion wouldn't work because old clients don't know to make a >>>>>>>>>>> second >>>>>>>>>>> request. >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi everyone, >>>>>>>>>>>> >>>>>>>>>>>> During the conversation of the Scan API for REST spec, we >>>>>>>>>>>> touched on the topic of pagination when REST response is large or >>>>>>>>>>>> takes >>>>>>>>>>>> time to be produced. >>>>>>>>>>>> >>>>>>>>>>>> I just want to discuss this separately, since we also see the >>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating >>>>>>>>>>>> with a >>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot >>>>>>>>>>>> of tables >>>>>>>>>>>> in some namespaces. >>>>>>>>>>>> >>>>>>>>>>>> Pagination requires either keeping state, or the response to be >>>>>>>>>>>> deterministic such that the client can request a range of the full >>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to >>>>>>>>>>>> allow some >>>>>>>>>>>> query parameters like: >>>>>>>>>>>> - *start*: the start index of the item in the response >>>>>>>>>>>> - *limit*: the number of items to be returned in the response >>>>>>>>>>>> >>>>>>>>>>>> So we can send a request like: >>>>>>>>>>>> >>>>>>>>>>>> *GET /namespaces?start=300&limit=100* >>>>>>>>>>>> >>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>>>>>>>> >>>>>>>>>>>> And the REST spec should enforce that the response returned for >>>>>>>>>>>> the paginated GET should be deterministic. >>>>>>>>>>>> >>>>>>>>>>>> Any thoughts on this? >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Jack Ye >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular