Not necessarily. That is more of a general statement. The pagination discussion forked from server side scan planning.
On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote: > > With start/limit each client can query for own's chunk without > coordination. > > Okay, I understand now. Would you need to parallelize the client for > listing namespaces or tables? That seems odd to me. > > On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> > You can parallelize with opaque tokens by sending a starting point for >> the next request. >> >> I meant we would have to wait for the server to return this starting >> point from the past request? With start/limit each client can query for >> own's chunk without coordination. >> >> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote: >> >>> > I think start and offset has the advantage of being parallelizable (as >>> compared to continuation tokens). >>> >>> You can parallelize with opaque tokens by sending a starting point for >>> the next request. >>> >>> > On the other hand, using "asOf" can be complex to implement and may >>> be too powerful for the pagination use case >>> >>> I don't think that we want to add `asOf`. If the service chooses to do >>> this, it would send a continuation token that has the information embedded. >>> >>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> Can we assume it is the responsibility of the server to ensure >>>> determinism (e.g., by caching the results along with query ID)? I think >>>> start and offset has the advantage of being parallelizable (as compared to >>>> continuation tokens). On the other hand, using "asOf" can be complex to >>>> implement and may be too powerful for the pagination use case (because it >>>> allows to query the warehouse as of any point of time, not just now). >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I think you can solve the atomicity problem with a continuation token >>>>> and server-side state. In general, I don't think this is a problem we >>>>> should worry about a lot since pagination commonly has this problem. But >>>>> since we can build a system that allows you to solve it if you choose to, >>>>> we should go with that design. >>>>> >>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Jack, >>>>>> Some answers inline. >>>>>> >>>>>> >>>>>>> In addition to the start index approach, another potential simple >>>>>>> way to implement the continuation token is to use the last item name, >>>>>>> when >>>>>>> the listing is guaranteed to be in lexicographic order. >>>>>> >>>>>> >>>>>> I think this is one viable implementation, but the reason that the >>>>>> token should be opaque is that it allows several different >>>>>> implementations >>>>>> without client side changes. >>>>>> >>>>>> For example, if an element is added before the continuation token, >>>>>>> then all future listing calls with the token would always skip that >>>>>>> element. >>>>>> >>>>>> >>>>>> IMO, I think this is fine, for some of the REST APIs it is likely >>>>>> important to put constraints on atomicity requirements, for others (e.g. >>>>>> list namespaces) I think it is OK to have looser requirements. >>>>>> >>>>>> If we want to enforce that level of atomicity, we probably want to >>>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) >>>>>>> to >>>>>>> ensure that we are listing results at a specific point of time of the >>>>>>> warehouse, so the complete result list is fixed. >>>>>> >>>>>> >>>>>> Time travel might be useful in some cases but I think it is >>>>>> orthogonal to services wishing to have guarantees around >>>>>> atomicity/consistency of results. If a server wants to ensure that >>>>>> results >>>>>> are atomic/consistent as of the start of the listing, it can embed the >>>>>> necessary timestamp in the token it returns and parse it out when >>>>>> fetching >>>>>> the next result. >>>>>> >>>>>> I think this does raise a more general point around service >>>>>> definition evolution in general. I think there likely need to be >>>>>> metadata >>>>>> endpoints that expose either: >>>>>> 1. A version of the REST API supported. >>>>>> 2. Features the API supports (e.g. which query parameters are >>>>>> honored for a specific endpoint). >>>>>> >>>>>> There are pros and cons to both approaches (apologies if I missed >>>>>> this in the spec or if it has already been discussed). >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> Yes I agree that it is better to not enforce the implementation to >>>>>>> favor any direction, and continuation token is probably better than >>>>>>> enforcing a numeric start index. >>>>>>> >>>>>>> In addition to the start index approach, another potential simple >>>>>>> way to implement the continuation token is to use the last item name, >>>>>>> when >>>>>>> the listing is guaranteed to be in lexicographic order. Compared to the >>>>>>> start index approach, it does not need to worry about the change of >>>>>>> start >>>>>>> index when something in the list is added or removed. >>>>>>> >>>>>>> However, the issue of concurrent modification could still exist even >>>>>>> with a continuation token. For example, if an element is added before >>>>>>> the >>>>>>> continuation token, then all future listing calls with the token would >>>>>>> always skip that element. If we want to enforce that level of >>>>>>> atomicity, we >>>>>>> probably want to introduce another time travel query parameter (e.g. >>>>>>> asOf=1703003028000) to ensure that we are listing results at a specific >>>>>>> point of time of the warehouse, so the complete result list is fixed. >>>>>>> (This >>>>>>> is also the missing piece I forgot to mention in the start index >>>>>>> approach >>>>>>> to ensure it works in distributed settings) >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I tried to cover these in more details at: >>>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>>>>>>> >>>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 for this approach. I agree that the streaming approach requires >>>>>>>>> that http client and servers have http 2 streaming support, which is >>>>>>>>> not >>>>>>>>> compatible with old clients. >>>>>>>>> >>>>>>>>> I share the same concern with Micah that only start/limit may not >>>>>>>>> be enough in a distributed environment where modification happens >>>>>>>>> during >>>>>>>>> iterations. For compatibility, we need to consider several cases: >>>>>>>>> >>>>>>>>> 1. Old client <-> New Server >>>>>>>>> 2. New client <-> Old server >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I agree that we want to include this feature and I raised similar >>>>>>>>>> concerns to what Micah already presented in talking with Ryan. >>>>>>>>>> >>>>>>>>>> For backward compatibility, just adding a start and limit >>>>>>>>>> implies a deterministic order, which is not a current requirement of >>>>>>>>>> the >>>>>>>>>> REST spec. >>>>>>>>>> >>>>>>>>>> Also, we need to consider whether the start/limit would need to >>>>>>>>>> be respected by the server. If existing implementations simply >>>>>>>>>> return all >>>>>>>>>> the results, will that be sufficient? There are a few edge cases >>>>>>>>>> that need >>>>>>>>>> to be considered here. >>>>>>>>>> >>>>>>>>>> For the opaque key approach, I think adding a query param to >>>>>>>>>> trigger/continue and introducing a continuation token in >>>>>>>>>> the ListNamespacesResponse might allow for more backward >>>>>>>>>> compatibility. In >>>>>>>>>> that scenario, pagination would only take place for clients who know >>>>>>>>>> how to >>>>>>>>>> paginate and the ordering would not need to be deterministic. >>>>>>>>>> >>>>>>>>>> -Dan >>>>>>>>>> >>>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield < >>>>>>>>>> emkornfi...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Just to clarify and add a small suggestion: >>>>>>>>>>> >>>>>>>>>>> The behavior with no additional parameters requires the >>>>>>>>>>> operations to happen as they do today for backwards compatibility >>>>>>>>>>> (i.e >>>>>>>>>>> either all responses are returned or a failure occurs). >>>>>>>>>>> >>>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead >>>>>>>>>>> of specific numeric offset) that can be returned by the service and >>>>>>>>>>> a limit >>>>>>>>>>> (as proposed above). If a start token is provided without a limit a >>>>>>>>>>> default limit can be chosen by the server. Servers might return >>>>>>>>>>> less than >>>>>>>>>>> limit (i.e. clients are required to check for a next token to >>>>>>>>>>> determine if >>>>>>>>>>> iteration is complete). This enables server side state if it is >>>>>>>>>>> desired >>>>>>>>>>> but also makes deterministic listing much more feasible >>>>>>>>>>> (deterministic >>>>>>>>>>> responses are essentially impossible in the face of changing data >>>>>>>>>>> if only a >>>>>>>>>>> start offset is provided). >>>>>>>>>>> >>>>>>>>>>> In an ideal world, specifying a limit would result in streaming >>>>>>>>>>> responses being returned with the last part either containing a >>>>>>>>>>> token if >>>>>>>>>>> continuation is necessary. Given conversation on the other thread >>>>>>>>>>> of >>>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API >>>>>>>>>>> REST >>>>>>>>>>> service. >>>>>>>>>>> >>>>>>>>>>> Therefore it seems like using pagination with token and offset >>>>>>>>>>> would be preferred. If skipping someplace in the middle of the >>>>>>>>>>> namespaces >>>>>>>>>>> is required then I would suggest modelling those as first class >>>>>>>>>>> query >>>>>>>>>>> parameters (e.g. "startAfterNamespace") >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Micah >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 for this approach >>>>>>>>>>>> >>>>>>>>>>>> I think it's good to use query params because it can be >>>>>>>>>>>> backward-compatible with the current behavior. If you get more >>>>>>>>>>>> than the >>>>>>>>>>>> limit back, then the service probably doesn't support pagination. >>>>>>>>>>>> And if a >>>>>>>>>>>> client doesn't support pagination they get the same results that >>>>>>>>>>>> they would >>>>>>>>>>>> today. A streaming approach with a continuation link like in the >>>>>>>>>>>> scan API >>>>>>>>>>>> discussion wouldn't work because old clients don't know to make a >>>>>>>>>>>> second >>>>>>>>>>>> request. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> During the conversation of the Scan API for REST spec, we >>>>>>>>>>>>> touched on the topic of pagination when REST response is large or >>>>>>>>>>>>> takes >>>>>>>>>>>>> time to be produced. >>>>>>>>>>>>> >>>>>>>>>>>>> I just want to discuss this separately, since we also see the >>>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating >>>>>>>>>>>>> with a >>>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot >>>>>>>>>>>>> of tables >>>>>>>>>>>>> in some namespaces. >>>>>>>>>>>>> >>>>>>>>>>>>> Pagination requires either keeping state, or the response to >>>>>>>>>>>>> be deterministic such that the client can request a range of the >>>>>>>>>>>>> full >>>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to >>>>>>>>>>>>> allow some >>>>>>>>>>>>> query parameters like: >>>>>>>>>>>>> - *start*: the start index of the item in the response >>>>>>>>>>>>> - *limit*: the number of items to be returned in the response >>>>>>>>>>>>> >>>>>>>>>>>>> So we can send a request like: >>>>>>>>>>>>> >>>>>>>>>>>>> *GET /namespaces?start=300&limit=100* >>>>>>>>>>>>> >>>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>>>>>>>>> >>>>>>>>>>>>> And the REST spec should enforce that the response returned >>>>>>>>>>>>> for the paginated GET should be deterministic. >>>>>>>>>>>>> >>>>>>>>>>>>> Any thoughts on this? >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Tabular >>>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >