I think you can solve the atomicity problem with a continuation token and server-side state. In general, I don't think this is a problem we should worry about a lot since pagination commonly has this problem. But since we can build a system that allows you to solve it if you choose to, we should go with that design.
On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Jack, > Some answers inline. > > >> In addition to the start index approach, another potential simple way to >> implement the continuation token is to use the last item name, when the >> listing is guaranteed to be in lexicographic order. > > > I think this is one viable implementation, but the reason that the token > should be opaque is that it allows several different implementations > without client side changes. > > For example, if an element is added before the continuation token, then >> all future listing calls with the token would always skip that element. > > > IMO, I think this is fine, for some of the REST APIs it is likely > important to put constraints on atomicity requirements, for others (e.g. > list namespaces) I think it is OK to have looser requirements. > > If we want to enforce that level of atomicity, we probably want to >> introduce another time travel query parameter (e.g. asOf=1703003028000) to >> ensure that we are listing results at a specific point of time of the >> warehouse, so the complete result list is fixed. > > > Time travel might be useful in some cases but I think it is orthogonal to > services wishing to have guarantees around atomicity/consistency of > results. If a server wants to ensure that results are atomic/consistent as > of the start of the listing, it can embed the necessary timestamp in the > token it returns and parse it out when fetching the next result. > > I think this does raise a more general point around service definition > evolution in general. I think there likely need to be metadata endpoints > that expose either: > 1. A version of the REST API supported. > 2. Features the API supports (e.g. which query parameters are honored for > a specific endpoint). > > There are pros and cons to both approaches (apologies if I missed this in > the spec or if it has already been discussed). > > Cheers, > Micah > > > > > > > On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Yes I agree that it is better to not enforce the implementation to favor >> any direction, and continuation token is probably better than enforcing a >> numeric start index. >> >> In addition to the start index approach, another potential simple way to >> implement the continuation token is to use the last item name, when the >> listing is guaranteed to be in lexicographic order. Compared to the start >> index approach, it does not need to worry about the change of start index >> when something in the list is added or removed. >> >> However, the issue of concurrent modification could still exist even with >> a continuation token. For example, if an element is added before the >> continuation token, then all future listing calls with the token would >> always skip that element. If we want to enforce that level of atomicity, we >> probably want to introduce another time travel query parameter (e.g. >> asOf=1703003028000) to ensure that we are listing results at a specific >> point of time of the warehouse, so the complete result list is fixed. (This >> is also the missing piece I forgot to mention in the start index approach >> to ensure it works in distributed settings) >> >> -Jack >> >> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> I tried to cover these in more details at: >>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>> >>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>> wrote: >>> >>>> +1 for this approach. I agree that the streaming approach requires that >>>> http client and servers have http 2 streaming support, which is not >>>> compatible with old clients. >>>> >>>> I share the same concern with Micah that only start/limit may not be >>>> enough in a distributed environment where modification happens during >>>> iterations. For compatibility, we need to consider several cases: >>>> >>>> 1. Old client <-> New Server >>>> 2. New client <-> Old server >>>> >>>> >>>> >>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> wrote: >>>> >>>>> I agree that we want to include this feature and I raised similar >>>>> concerns to what Micah already presented in talking with Ryan. >>>>> >>>>> For backward compatibility, just adding a start and limit implies a >>>>> deterministic order, which is not a current requirement of the REST spec. >>>>> >>>>> Also, we need to consider whether the start/limit would need to be >>>>> respected by the server. If existing implementations simply return all >>>>> the >>>>> results, will that be sufficient? There are a few edge cases that need to >>>>> be considered here. >>>>> >>>>> For the opaque key approach, I think adding a query param to >>>>> trigger/continue and introducing a continuation token in >>>>> the ListNamespacesResponse might allow for more backward compatibility. >>>>> In >>>>> that scenario, pagination would only take place for clients who know how >>>>> to >>>>> paginate and the ordering would not need to be deterministic. >>>>> >>>>> -Dan >>>>> >>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Just to clarify and add a small suggestion: >>>>>> >>>>>> The behavior with no additional parameters requires the operations to >>>>>> happen as they do today for backwards compatibility (i.e either all >>>>>> responses are returned or a failure occurs). >>>>>> >>>>>> For new parameters, I'd suggest an opaque start token (instead of >>>>>> specific numeric offset) that can be returned by the service and a limit >>>>>> (as proposed above). If a start token is provided without a limit a >>>>>> default limit can be chosen by the server. Servers might return less >>>>>> than >>>>>> limit (i.e. clients are required to check for a next token to determine >>>>>> if >>>>>> iteration is complete). This enables server side state if it is desired >>>>>> but also makes deterministic listing much more feasible (deterministic >>>>>> responses are essentially impossible in the face of changing data if >>>>>> only a >>>>>> start offset is provided). >>>>>> >>>>>> In an ideal world, specifying a limit would result in streaming >>>>>> responses being returned with the last part either containing a token if >>>>>> continuation is necessary. Given conversation on the other thread of >>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST >>>>>> service. >>>>>> >>>>>> Therefore it seems like using pagination with token and offset would >>>>>> be preferred. If skipping someplace in the middle of the namespaces is >>>>>> required then I would suggest modelling those as first class query >>>>>> parameters (e.g. "startAfterNamespace") >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> +1 for this approach >>>>>>> >>>>>>> I think it's good to use query params because it can be >>>>>>> backward-compatible with the current behavior. If you get more than the >>>>>>> limit back, then the service probably doesn't support pagination. And >>>>>>> if a >>>>>>> client doesn't support pagination they get the same results that they >>>>>>> would >>>>>>> today. A streaming approach with a continuation link like in the scan >>>>>>> API >>>>>>> discussion wouldn't work because old clients don't know to make a second >>>>>>> request. >>>>>>> >>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> During the conversation of the Scan API for REST spec, we touched >>>>>>>> on the topic of pagination when REST response is large or takes time >>>>>>>> to be >>>>>>>> produced. >>>>>>>> >>>>>>>> I just want to discuss this separately, since we also see the issue >>>>>>>> for ListNamespaces and ListTables/Views, when integrating with a large >>>>>>>> organization that has over 100k namespaces, and also a lot of tables in >>>>>>>> some namespaces. >>>>>>>> >>>>>>>> Pagination requires either keeping state, or the response to be >>>>>>>> deterministic such that the client can request a range of the full >>>>>>>> response. If we want to avoid keeping state, I think we need to allow >>>>>>>> some >>>>>>>> query parameters like: >>>>>>>> - *start*: the start index of the item in the response >>>>>>>> - *limit*: the number of items to be returned in the response >>>>>>>> >>>>>>>> So we can send a request like: >>>>>>>> >>>>>>>> *GET /namespaces?start=300&limit=100* >>>>>>>> >>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>>>> >>>>>>>> And the REST spec should enforce that the response returned for the >>>>>>>> paginated GET should be deterministic. >>>>>>>> >>>>>>>> Any thoughts on this? >>>>>>>> >>>>>>>> Best, >>>>>>>> Jack Ye >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> -- Ryan Blue Tabular