> I think start and offset has the advantage of being parallelizable (as compared to continuation tokens).
You can parallelize with opaque tokens by sending a starting point for the next request. > On the other hand, using "asOf" can be complex to implement and may be too powerful for the pagination use case I don't think that we want to add `asOf`. If the service chooses to do this, it would send a continuation token that has the information embedded. On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Can we assume it is the responsibility of the server to ensure determinism > (e.g., by caching the results along with query ID)? I think start and > offset has the advantage of being parallelizable (as compared to > continuation tokens). On the other hand, using "asOf" can be complex to > implement and may be too powerful for the pagination use case (because it > allows to query the warehouse as of any point of time, not just now). > > Thanks, > Walaa. > > On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: > >> I think you can solve the atomicity problem with a continuation token and >> server-side state. In general, I don't think this is a problem we should >> worry about a lot since pagination commonly has this problem. But since we >> can build a system that allows you to solve it if you choose to, we should >> go with that design. >> >> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Hi Jack, >>> Some answers inline. >>> >>> >>>> In addition to the start index approach, another potential simple way >>>> to implement the continuation token is to use the last item name, when the >>>> listing is guaranteed to be in lexicographic order. >>> >>> >>> I think this is one viable implementation, but the reason that the token >>> should be opaque is that it allows several different implementations >>> without client side changes. >>> >>> For example, if an element is added before the continuation token, then >>>> all future listing calls with the token would always skip that element. >>> >>> >>> IMO, I think this is fine, for some of the REST APIs it is likely >>> important to put constraints on atomicity requirements, for others (e.g. >>> list namespaces) I think it is OK to have looser requirements. >>> >>> If we want to enforce that level of atomicity, we probably want to >>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to >>>> ensure that we are listing results at a specific point of time of the >>>> warehouse, so the complete result list is fixed. >>> >>> >>> Time travel might be useful in some cases but I think it is orthogonal >>> to services wishing to have guarantees around atomicity/consistency of >>> results. If a server wants to ensure that results are atomic/consistent as >>> of the start of the listing, it can embed the necessary timestamp in the >>> token it returns and parse it out when fetching the next result. >>> >>> I think this does raise a more general point around service definition >>> evolution in general. I think there likely need to be metadata endpoints >>> that expose either: >>> 1. A version of the REST API supported. >>> 2. Features the API supports (e.g. which query parameters are honored >>> for a specific endpoint). >>> >>> There are pros and cons to both approaches (apologies if I missed this >>> in the spec or if it has already been discussed). >>> >>> Cheers, >>> Micah >>> >>> >>> >>> >>> >>> >>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Yes I agree that it is better to not enforce the implementation to >>>> favor any direction, and continuation token is probably better than >>>> enforcing a numeric start index. >>>> >>>> In addition to the start index approach, another potential simple way >>>> to implement the continuation token is to use the last item name, when the >>>> listing is guaranteed to be in lexicographic order. Compared to the start >>>> index approach, it does not need to worry about the change of start index >>>> when something in the list is added or removed. >>>> >>>> However, the issue of concurrent modification could still exist even >>>> with a continuation token. For example, if an element is added before the >>>> continuation token, then all future listing calls with the token would >>>> always skip that element. If we want to enforce that level of atomicity, we >>>> probably want to introduce another time travel query parameter (e.g. >>>> asOf=1703003028000) to ensure that we are listing results at a specific >>>> point of time of the warehouse, so the complete result list is fixed. (This >>>> is also the missing piece I forgot to mention in the start index approach >>>> to ensure it works in distributed settings) >>>> >>>> -Jack >>>> >>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>>> I tried to cover these in more details at: >>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>>>> >>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 for this approach. I agree that the streaming approach requires >>>>>> that http client and servers have http 2 streaming support, which is not >>>>>> compatible with old clients. >>>>>> >>>>>> I share the same concern with Micah that only start/limit may not be >>>>>> enough in a distributed environment where modification happens during >>>>>> iterations. For compatibility, we need to consider several cases: >>>>>> >>>>>> 1. Old client <-> New Server >>>>>> 2. New client <-> Old server >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I agree that we want to include this feature and I raised similar >>>>>>> concerns to what Micah already presented in talking with Ryan. >>>>>>> >>>>>>> For backward compatibility, just adding a start and limit implies a >>>>>>> deterministic order, which is not a current requirement of the REST >>>>>>> spec. >>>>>>> >>>>>>> Also, we need to consider whether the start/limit would need to be >>>>>>> respected by the server. If existing implementations simply return all >>>>>>> the >>>>>>> results, will that be sufficient? There are a few edge cases that need >>>>>>> to >>>>>>> be considered here. >>>>>>> >>>>>>> For the opaque key approach, I think adding a query param to >>>>>>> trigger/continue and introducing a continuation token in >>>>>>> the ListNamespacesResponse might allow for more backward compatibility. >>>>>>> In >>>>>>> that scenario, pagination would only take place for clients who know >>>>>>> how to >>>>>>> paginate and the ordering would not need to be deterministic. >>>>>>> >>>>>>> -Dan >>>>>>> >>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield < >>>>>>> emkornfi...@gmail.com> wrote: >>>>>>> >>>>>>>> Just to clarify and add a small suggestion: >>>>>>>> >>>>>>>> The behavior with no additional parameters requires the operations >>>>>>>> to happen as they do today for backwards compatibility (i.e either all >>>>>>>> responses are returned or a failure occurs). >>>>>>>> >>>>>>>> For new parameters, I'd suggest an opaque start token (instead of >>>>>>>> specific numeric offset) that can be returned by the service and a >>>>>>>> limit >>>>>>>> (as proposed above). If a start token is provided without a limit a >>>>>>>> default limit can be chosen by the server. Servers might return less >>>>>>>> than >>>>>>>> limit (i.e. clients are required to check for a next token to >>>>>>>> determine if >>>>>>>> iteration is complete). This enables server side state if it is >>>>>>>> desired >>>>>>>> but also makes deterministic listing much more feasible (deterministic >>>>>>>> responses are essentially impossible in the face of changing data if >>>>>>>> only a >>>>>>>> start offset is provided). >>>>>>>> >>>>>>>> In an ideal world, specifying a limit would result in streaming >>>>>>>> responses being returned with the last part either containing a token >>>>>>>> if >>>>>>>> continuation is necessary. Given conversation on the other thread of >>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST >>>>>>>> service. >>>>>>>> >>>>>>>> Therefore it seems like using pagination with token and offset >>>>>>>> would be preferred. If skipping someplace in the middle of the >>>>>>>> namespaces >>>>>>>> is required then I would suggest modelling those as first class query >>>>>>>> parameters (e.g. "startAfterNamespace") >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Micah >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> +1 for this approach >>>>>>>>> >>>>>>>>> I think it's good to use query params because it can be >>>>>>>>> backward-compatible with the current behavior. If you get more than >>>>>>>>> the >>>>>>>>> limit back, then the service probably doesn't support pagination. And >>>>>>>>> if a >>>>>>>>> client doesn't support pagination they get the same results that they >>>>>>>>> would >>>>>>>>> today. A streaming approach with a continuation link like in the scan >>>>>>>>> API >>>>>>>>> discussion wouldn't work because old clients don't know to make a >>>>>>>>> second >>>>>>>>> request. >>>>>>>>> >>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> During the conversation of the Scan API for REST spec, we touched >>>>>>>>>> on the topic of pagination when REST response is large or takes time >>>>>>>>>> to be >>>>>>>>>> produced. >>>>>>>>>> >>>>>>>>>> I just want to discuss this separately, since we also see the >>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating with >>>>>>>>>> a >>>>>>>>>> large organization that has over 100k namespaces, and also a lot of >>>>>>>>>> tables >>>>>>>>>> in some namespaces. >>>>>>>>>> >>>>>>>>>> Pagination requires either keeping state, or the response to be >>>>>>>>>> deterministic such that the client can request a range of the full >>>>>>>>>> response. If we want to avoid keeping state, I think we need to >>>>>>>>>> allow some >>>>>>>>>> query parameters like: >>>>>>>>>> - *start*: the start index of the item in the response >>>>>>>>>> - *limit*: the number of items to be returned in the response >>>>>>>>>> >>>>>>>>>> So we can send a request like: >>>>>>>>>> >>>>>>>>>> *GET /namespaces?start=300&limit=100* >>>>>>>>>> >>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>>>>>> >>>>>>>>>> And the REST spec should enforce that the response returned for >>>>>>>>>> the paginated GET should be deterministic. >>>>>>>>>> >>>>>>>>>> Any thoughts on this? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jack Ye >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular