Re: Pagination for List APIs in the REST spec

Ryan Blue Tue, 19 Dec 2023 09:44:51 -0800

> I think start and offset has the advantage of being parallelizable (as
compared to continuation tokens).


You can parallelize with opaque tokens by sending a starting point for the
next request.

> On the other hand, using "asOf" can be complex to  implement and may be
too powerful for the pagination use case

I don't think that we want to add `asOf`. If the service chooses to do
this, it would send a continuation token that has the information embedded.

On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Can we assume it is the responsibility of the server to ensure determinism
> (e.g., by caching the results along with query ID)? I think start and
> offset has the advantage of being parallelizable (as compared to
> continuation tokens). On the other hand, using "asOf" can be complex to
>  implement and may be too powerful for the pagination use case (because it
> allows to query the warehouse as of any point of time, not just now).
>
> Thanks,
> Walaa.
>
> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>
>> I think you can solve the atomicity problem with a continuation token and
>> server-side state. In general, I don't think this is a problem we should
>> worry about a lot since pagination commonly has this problem. But since we
>> can build a system that allows you to solve it if you choose to, we should
>> go with that design.
>>
>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Jack,
>>> Some answers inline.
>>>
>>>
>>>> In addition to the start index approach, another potential simple way
>>>> to implement the continuation token is to use the last item name, when the
>>>> listing is guaranteed to be in lexicographic order.
>>>
>>>
>>> I think this is one viable implementation, but the reason that the token
>>> should be opaque is that it allows several different implementations
>>> without client side changes.
>>>
>>> For example, if an element is added before the continuation token, then
>>>> all future listing calls with the token would always skip that element.
>>>
>>>
>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>> important to put constraints on atomicity requirements, for others (e.g.
>>> list namespaces) I think it is OK to have looser requirements.
>>>
>>> If we want to enforce that level of atomicity, we probably want to
>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to
>>>> ensure that we are listing results at a specific point of time of the
>>>> warehouse, so the complete result list is fixed.
>>>
>>>
>>> Time travel might be useful in some cases but I think it is orthogonal
>>> to services wishing to have guarantees around  atomicity/consistency of
>>> results.  If a server wants to ensure that results are atomic/consistent as
>>> of the start of the listing, it can embed the necessary timestamp in the
>>> token it returns and parse it out when fetching the next result.
>>>
>>> I think this does raise a more general point around service definition
>>> evolution in general.  I think there likely need to be metadata endpoints
>>> that expose either:
>>> 1.  A version of the REST API supported.
>>> 2.  Features the API supports (e.g. which query parameters are honored
>>> for a specific endpoint).
>>>
>>> There are pros and cons to both approaches (apologies if I missed this
>>> in the spec or if it has already been discussed).
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Yes I agree that it is better to not enforce the implementation to
>>>> favor any direction, and continuation token is probably better than
>>>> enforcing a numeric start index.
>>>>
>>>> In addition to the start index approach, another potential simple way
>>>> to implement the continuation token is to use the last item name, when the
>>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>>> index approach, it does not need to worry about the change of start index
>>>> when something in the list is added or removed.
>>>>
>>>> However, the issue of concurrent modification could still exist even
>>>> with a continuation token. For example, if an element is added before the
>>>> continuation token, then all future listing calls with the token would
>>>> always skip that element. If we want to enforce that level of atomicity, we
>>>> probably want to introduce another time travel query parameter (e.g.
>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>> point of time of the warehouse, so the complete result list is fixed. (This
>>>> is also the missing piece I forgot to mention in the start index approach
>>>> to ensure it works in distributed settings)
>>>>
>>>> -Jack
>>>>
>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried to cover these in more details at:
>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>
>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>>> that http client and servers have http 2 streaming support, which is not
>>>>>> compatible with old clients.
>>>>>>
>>>>>> I share the same concern with Micah that only start/limit may not be
>>>>>> enough in a distributed environment where modification happens during
>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>
>>>>>> 1. Old client <-> New Server
>>>>>> 2. New client <-> Old server
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree that we want to include this feature and I raised similar
>>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>>
>>>>>>> For backward compatibility, just adding a start and limit implies a
>>>>>>> deterministic order, which is not a current requirement of the REST 
>>>>>>> spec.
>>>>>>>
>>>>>>> Also, we need to consider whether the start/limit would need to be
>>>>>>> respected by the server.  If existing implementations simply return all 
>>>>>>> the
>>>>>>> results, will that be sufficient?  There are a few edge cases that need 
>>>>>>> to
>>>>>>> be considered here.
>>>>>>>
>>>>>>> For the opaque key approach, I think adding a query param to
>>>>>>> trigger/continue and introducing a continuation token in
>>>>>>> the ListNamespacesResponse might allow for more backward compatibility. 
>>>>>>>  In
>>>>>>> that scenario, pagination would only take place for clients who know 
>>>>>>> how to
>>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <
>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>
>>>>>>>> The behavior with no additional parameters requires the operations
>>>>>>>> to happen as they do today for backwards compatibility (i.e either all
>>>>>>>> responses are returned or a failure occurs).
>>>>>>>>
>>>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>>>> specific numeric offset) that can be returned by the service and a 
>>>>>>>> limit
>>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>>> default limit can be chosen by the server.  Servers might return less 
>>>>>>>> than
>>>>>>>> limit (i.e. clients are required to check for a next token to 
>>>>>>>> determine if
>>>>>>>> iteration is complete).  This enables server side state if it is 
>>>>>>>> desired
>>>>>>>> but also makes deterministic listing much more feasible (deterministic
>>>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>>>> only a
>>>>>>>> start offset is provided).
>>>>>>>>
>>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>>> responses being returned with the last part either containing a token 
>>>>>>>> if
>>>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>>>>>> service.
>>>>>>>>
>>>>>>>> Therefore it seems like using pagination with token and offset
>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>> namespaces
>>>>>>>> is required then I would suggest modelling those as first class query
>>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Micah
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> +1 for this approach
>>>>>>>>>
>>>>>>>>> I think it's good to use query params because it can be
>>>>>>>>> backward-compatible with the current behavior. If you get more than 
>>>>>>>>> the
>>>>>>>>> limit back, then the service probably doesn't support pagination. And 
>>>>>>>>> if a
>>>>>>>>> client doesn't support pagination they get the same results that they 
>>>>>>>>> would
>>>>>>>>> today. A streaming approach with a continuation link like in the scan 
>>>>>>>>> API
>>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>>> second
>>>>>>>>> request.
>>>>>>>>>
>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> During the conversation of the Scan API for REST spec, we touched
>>>>>>>>>> on the topic of pagination when REST response is large or takes time 
>>>>>>>>>> to be
>>>>>>>>>> produced.
>>>>>>>>>>
>>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating with 
>>>>>>>>>> a
>>>>>>>>>> large organization that has over 100k namespaces, and also a lot of 
>>>>>>>>>> tables
>>>>>>>>>> in some namespaces.
>>>>>>>>>>
>>>>>>>>>> Pagination requires either keeping state, or the response to be
>>>>>>>>>> deterministic such that the client can request a range of the full
>>>>>>>>>> response. If we want to avoid keeping state, I think we need to 
>>>>>>>>>> allow some
>>>>>>>>>> query parameters like:
>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>
>>>>>>>>>> So we can send a request like:
>>>>>>>>>>
>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>
>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>
>>>>>>>>>> And the REST spec should enforce that the response returned for
>>>>>>>>>> the paginated GET should be deterministic.
>>>>>>>>>>
>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Pagination for List APIs in the REST spec

Reply via email to