Re: Pagination for List APIs in the REST spec

Walaa Eldin Moustafa Tue, 19 Dec 2023 09:55:51 -0800

Not necessarily. That is more of a general statement. The pagination
discussion forked from server side scan planning.


On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote:

> > With start/limit each client can query for own's chunk without
> coordination.
>
> Okay, I understand now. Would you need to parallelize the client for
> listing namespaces or tables? That seems odd to me.
>
> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> > You can parallelize with opaque tokens by sending a starting point for
>> the next request.
>>
>> I meant we would have to wait for the server to return this starting
>> point from the past request? With start/limit each client can query for
>> own's chunk without coordination.
>>
>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> > I think start and offset has the advantage of being parallelizable (as
>>> compared to continuation tokens).
>>>
>>> You can parallelize with opaque tokens by sending a starting point for
>>> the next request.
>>>
>>> > On the other hand, using "asOf" can be complex to  implement and may
>>> be too powerful for the pagination use case
>>>
>>> I don't think that we want to add `asOf`. If the service chooses to do
>>> this, it would send a continuation token that has the information embedded.
>>>
>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> Can we assume it is the responsibility of the server to ensure
>>>> determinism (e.g., by caching the results along with query ID)? I think
>>>> start and offset has the advantage of being parallelizable (as compared to
>>>> continuation tokens). On the other hand, using "asOf" can be complex to
>>>>  implement and may be too powerful for the pagination use case (because it
>>>> allows to query the warehouse as of any point of time, not just now).
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I think you can solve the atomicity problem with a continuation token
>>>>> and server-side state. In general, I don't think this is a problem we
>>>>> should worry about a lot since pagination commonly has this problem. But
>>>>> since we can build a system that allows you to solve it if you choose to,
>>>>> we should go with that design.
>>>>>
>>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jack,
>>>>>> Some answers inline.
>>>>>>
>>>>>>
>>>>>>> In addition to the start index approach, another potential simple
>>>>>>> way to implement the continuation token is to use the last item name, 
>>>>>>> when
>>>>>>> the listing is guaranteed to be in lexicographic order.
>>>>>>
>>>>>>
>>>>>> I think this is one viable implementation, but the reason that the
>>>>>> token should be opaque is that it allows several different 
>>>>>> implementations
>>>>>> without client side changes.
>>>>>>
>>>>>> For example, if an element is added before the continuation token,
>>>>>>> then all future listing calls with the token would always skip that
>>>>>>> element.
>>>>>>
>>>>>>
>>>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>>>> list namespaces) I think it is OK to have looser requirements.
>>>>>>
>>>>>> If we want to enforce that level of atomicity, we probably want to
>>>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) 
>>>>>>> to
>>>>>>> ensure that we are listing results at a specific point of time of the
>>>>>>> warehouse, so the complete result list is fixed.
>>>>>>
>>>>>>
>>>>>> Time travel might be useful in some cases but I think it is
>>>>>> orthogonal to services wishing to have guarantees around
>>>>>> atomicity/consistency of results.  If a server wants to ensure that 
>>>>>> results
>>>>>> are atomic/consistent as of the start of the listing, it can embed the
>>>>>> necessary timestamp in the token it returns and parse it out when 
>>>>>> fetching
>>>>>> the next result.
>>>>>>
>>>>>> I think this does raise a more general point around service
>>>>>> definition evolution in general.  I think there likely need to be 
>>>>>> metadata
>>>>>> endpoints that expose either:
>>>>>> 1.  A version of the REST API supported.
>>>>>> 2.  Features the API supports (e.g. which query parameters are
>>>>>> honored for a specific endpoint).
>>>>>>
>>>>>> There are pros and cons to both approaches (apologies if I missed
>>>>>> this in the spec or if it has already been discussed).
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes I agree that it is better to not enforce the implementation to
>>>>>>> favor any direction, and continuation token is probably better than
>>>>>>> enforcing a numeric start index.
>>>>>>>
>>>>>>> In addition to the start index approach, another potential simple
>>>>>>> way to implement the continuation token is to use the last item name, 
>>>>>>> when
>>>>>>> the listing is guaranteed to be in lexicographic order. Compared to the
>>>>>>> start index approach, it does not need to worry about the change of 
>>>>>>> start
>>>>>>> index when something in the list is added or removed.
>>>>>>>
>>>>>>> However, the issue of concurrent modification could still exist even
>>>>>>> with a continuation token. For example, if an element is added before 
>>>>>>> the
>>>>>>> continuation token, then all future listing calls with the token would
>>>>>>> always skip that element. If we want to enforce that level of 
>>>>>>> atomicity, we
>>>>>>> probably want to introduce another time travel query parameter (e.g.
>>>>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>>>>> point of time of the warehouse, so the complete result list is fixed. 
>>>>>>> (This
>>>>>>> is also the missing piece I forgot to mention in the start index 
>>>>>>> approach
>>>>>>> to ensure it works in distributed settings)
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I tried to cover these in more details at:
>>>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>>>
>>>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>>>>>> that http client and servers have http 2 streaming support, which is 
>>>>>>>>> not
>>>>>>>>> compatible with old clients.
>>>>>>>>>
>>>>>>>>> I share the same concern with Micah that only start/limit may not
>>>>>>>>> be enough in a distributed environment where modification happens 
>>>>>>>>> during
>>>>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>>>>
>>>>>>>>> 1. Old client <-> New Server
>>>>>>>>> 2. New client <-> Old server
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I agree that we want to include this feature and I raised similar
>>>>>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>>>>>
>>>>>>>>>> For backward compatibility, just adding a start and limit
>>>>>>>>>> implies a deterministic order, which is not a current requirement of 
>>>>>>>>>> the
>>>>>>>>>> REST spec.
>>>>>>>>>>
>>>>>>>>>> Also, we need to consider whether the start/limit would need to
>>>>>>>>>> be respected by the server.  If existing implementations simply 
>>>>>>>>>> return all
>>>>>>>>>> the results, will that be sufficient?  There are a few edge cases 
>>>>>>>>>> that need
>>>>>>>>>> to be considered here.
>>>>>>>>>>
>>>>>>>>>> For the opaque key approach, I think adding a query param to
>>>>>>>>>> trigger/continue and introducing a continuation token in
>>>>>>>>>> the ListNamespacesResponse might allow for more backward 
>>>>>>>>>> compatibility.  In
>>>>>>>>>> that scenario, pagination would only take place for clients who know 
>>>>>>>>>> how to
>>>>>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>>>>>
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <
>>>>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>>>>
>>>>>>>>>>> The behavior with no additional parameters requires the
>>>>>>>>>>> operations to happen as they do today for backwards compatibility 
>>>>>>>>>>> (i.e
>>>>>>>>>>> either all responses are returned or a failure occurs).
>>>>>>>>>>>
>>>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead
>>>>>>>>>>> of specific numeric offset) that can be returned by the service and 
>>>>>>>>>>> a limit
>>>>>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>>>>>> default limit can be chosen by the server.  Servers might return 
>>>>>>>>>>> less than
>>>>>>>>>>> limit (i.e. clients are required to check for a next token to 
>>>>>>>>>>> determine if
>>>>>>>>>>> iteration is complete).  This enables server side state if it is 
>>>>>>>>>>> desired
>>>>>>>>>>> but also makes deterministic listing much more feasible 
>>>>>>>>>>> (deterministic
>>>>>>>>>>> responses are essentially impossible in the face of changing data 
>>>>>>>>>>> if only a
>>>>>>>>>>> start offset is provided).
>>>>>>>>>>>
>>>>>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>>>>>> responses being returned with the last part either containing a 
>>>>>>>>>>> token if
>>>>>>>>>>> continuation is necessary.  Given conversation on the other thread 
>>>>>>>>>>> of
>>>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API 
>>>>>>>>>>> REST
>>>>>>>>>>> service.
>>>>>>>>>>>
>>>>>>>>>>> Therefore it seems like using pagination with token and offset
>>>>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>>>>> namespaces
>>>>>>>>>>> is required then I would suggest modelling those as first class 
>>>>>>>>>>> query
>>>>>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Micah
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 for this approach
>>>>>>>>>>>>
>>>>>>>>>>>> I think it's good to use query params because it can be
>>>>>>>>>>>> backward-compatible with the current behavior. If you get more 
>>>>>>>>>>>> than the
>>>>>>>>>>>> limit back, then the service probably doesn't support pagination. 
>>>>>>>>>>>> And if a
>>>>>>>>>>>> client doesn't support pagination they get the same results that 
>>>>>>>>>>>> they would
>>>>>>>>>>>> today. A streaming approach with a continuation link like in the 
>>>>>>>>>>>> scan API
>>>>>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>>>>>> second
>>>>>>>>>>>> request.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> During the conversation of the Scan API for REST spec, we
>>>>>>>>>>>>> touched on the topic of pagination when REST response is large or 
>>>>>>>>>>>>> takes
>>>>>>>>>>>>> time to be produced.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating 
>>>>>>>>>>>>> with a
>>>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot 
>>>>>>>>>>>>> of tables
>>>>>>>>>>>>> in some namespaces.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pagination requires either keeping state, or the response to
>>>>>>>>>>>>> be deterministic such that the client can request a range of the 
>>>>>>>>>>>>> full
>>>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to 
>>>>>>>>>>>>> allow some
>>>>>>>>>>>>> query parameters like:
>>>>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>>>>
>>>>>>>>>>>>> So we can send a request like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>>>>
>>>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>>>>
>>>>>>>>>>>>> And the REST spec should enforce that the response returned
>>>>>>>>>>>>> for the paginated GET should be deterministic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Pagination for List APIs in the REST spec

Reply via email to