Re: Pagination for List APIs in the REST spec

Ryan Blue Tue, 19 Dec 2023 09:52:38 -0800

> With start/limit each client can query for own's chunk without
coordination.


Okay, I understand now. Would you need to parallelize the client for
listing namespaces or tables? That seems odd to me.

On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> > You can parallelize with opaque tokens by sending a starting point for
> the next request.
>
> I meant we would have to wait for the server to return this starting point
> from the past request? With start/limit each client can query for own's
> chunk without coordination.
>
> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>
>> > I think start and offset has the advantage of being parallelizable (as
>> compared to continuation tokens).
>>
>> You can parallelize with opaque tokens by sending a starting point for
>> the next request.
>>
>> > On the other hand, using "asOf" can be complex to  implement and may be
>> too powerful for the pagination use case
>>
>> I don't think that we want to add `asOf`. If the service chooses to do
>> this, it would send a continuation token that has the information embedded.
>>
>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Can we assume it is the responsibility of the server to ensure
>>> determinism (e.g., by caching the results along with query ID)? I think
>>> start and offset has the advantage of being parallelizable (as compared to
>>> continuation tokens). On the other hand, using "asOf" can be complex to
>>>  implement and may be too powerful for the pagination use case (because it
>>> allows to query the warehouse as of any point of time, not just now).
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I think you can solve the atomicity problem with a continuation token
>>>> and server-side state. In general, I don't think this is a problem we
>>>> should worry about a lot since pagination commonly has this problem. But
>>>> since we can build a system that allows you to solve it if you choose to,
>>>> we should go with that design.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jack,
>>>>> Some answers inline.
>>>>>
>>>>>
>>>>>> In addition to the start index approach, another potential simple way
>>>>>> to implement the continuation token is to use the last item name, when 
>>>>>> the
>>>>>> listing is guaranteed to be in lexicographic order.
>>>>>
>>>>>
>>>>> I think this is one viable implementation, but the reason that the
>>>>> token should be opaque is that it allows several different implementations
>>>>> without client side changes.
>>>>>
>>>>> For example, if an element is added before the continuation token,
>>>>>> then all future listing calls with the token would always skip that
>>>>>> element.
>>>>>
>>>>>
>>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>>> list namespaces) I think it is OK to have looser requirements.
>>>>>
>>>>> If we want to enforce that level of atomicity, we probably want to
>>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) 
>>>>>> to
>>>>>> ensure that we are listing results at a specific point of time of the
>>>>>> warehouse, so the complete result list is fixed.
>>>>>
>>>>>
>>>>> Time travel might be useful in some cases but I think it is orthogonal
>>>>> to services wishing to have guarantees around  atomicity/consistency of
>>>>> results.  If a server wants to ensure that results are atomic/consistent 
>>>>> as
>>>>> of the start of the listing, it can embed the necessary timestamp in the
>>>>> token it returns and parse it out when fetching the next result.
>>>>>
>>>>> I think this does raise a more general point around service definition
>>>>> evolution in general.  I think there likely need to be metadata endpoints
>>>>> that expose either:
>>>>> 1.  A version of the REST API supported.
>>>>> 2.  Features the API supports (e.g. which query parameters are honored
>>>>> for a specific endpoint).
>>>>>
>>>>> There are pros and cons to both approaches (apologies if I missed this
>>>>> in the spec or if it has already been discussed).
>>>>>
>>>>> Cheers,
>>>>> Micah
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Yes I agree that it is better to not enforce the implementation to
>>>>>> favor any direction, and continuation token is probably better than
>>>>>> enforcing a numeric start index.
>>>>>>
>>>>>> In addition to the start index approach, another potential simple way
>>>>>> to implement the continuation token is to use the last item name, when 
>>>>>> the
>>>>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>>>>> index approach, it does not need to worry about the change of start index
>>>>>> when something in the list is added or removed.
>>>>>>
>>>>>> However, the issue of concurrent modification could still exist even
>>>>>> with a continuation token. For example, if an element is added before the
>>>>>> continuation token, then all future listing calls with the token would
>>>>>> always skip that element. If we want to enforce that level of atomicity, 
>>>>>> we
>>>>>> probably want to introduce another time travel query parameter (e.g.
>>>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>>>> point of time of the warehouse, so the complete result list is fixed. 
>>>>>> (This
>>>>>> is also the missing piece I forgot to mention in the start index approach
>>>>>> to ensure it works in distributed settings)
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I tried to cover these in more details at:
>>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>>
>>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>>>>> that http client and servers have http 2 streaming support, which is 
>>>>>>>> not
>>>>>>>> compatible with old clients.
>>>>>>>>
>>>>>>>> I share the same concern with Micah that only start/limit may not
>>>>>>>> be enough in a distributed environment where modification happens 
>>>>>>>> during
>>>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>>>
>>>>>>>> 1. Old client <-> New Server
>>>>>>>> 2. New client <-> Old server
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I agree that we want to include this feature and I raised similar
>>>>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>>>>
>>>>>>>>> For backward compatibility, just adding a start and limit
>>>>>>>>> implies a deterministic order, which is not a current requirement of 
>>>>>>>>> the
>>>>>>>>> REST spec.
>>>>>>>>>
>>>>>>>>> Also, we need to consider whether the start/limit would need to be
>>>>>>>>> respected by the server.  If existing implementations simply return 
>>>>>>>>> all the
>>>>>>>>> results, will that be sufficient?  There are a few edge cases that 
>>>>>>>>> need to
>>>>>>>>> be considered here.
>>>>>>>>>
>>>>>>>>> For the opaque key approach, I think adding a query param to
>>>>>>>>> trigger/continue and introducing a continuation token in
>>>>>>>>> the ListNamespacesResponse might allow for more backward 
>>>>>>>>> compatibility.  In
>>>>>>>>> that scenario, pagination would only take place for clients who know 
>>>>>>>>> how to
>>>>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <
>>>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>>>
>>>>>>>>>> The behavior with no additional parameters requires the
>>>>>>>>>> operations to happen as they do today for backwards compatibility 
>>>>>>>>>> (i.e
>>>>>>>>>> either all responses are returned or a failure occurs).
>>>>>>>>>>
>>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>>>>>> specific numeric offset) that can be returned by the service and a 
>>>>>>>>>> limit
>>>>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>>>>> default limit can be chosen by the server.  Servers might return 
>>>>>>>>>> less than
>>>>>>>>>> limit (i.e. clients are required to check for a next token to 
>>>>>>>>>> determine if
>>>>>>>>>> iteration is complete).  This enables server side state if it is 
>>>>>>>>>> desired
>>>>>>>>>> but also makes deterministic listing much more feasible 
>>>>>>>>>> (deterministic
>>>>>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>>>>>> only a
>>>>>>>>>> start offset is provided).
>>>>>>>>>>
>>>>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>>>>> responses being returned with the last part either containing a 
>>>>>>>>>> token if
>>>>>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API 
>>>>>>>>>> REST
>>>>>>>>>> service.
>>>>>>>>>>
>>>>>>>>>> Therefore it seems like using pagination with token and offset
>>>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>>>> namespaces
>>>>>>>>>> is required then I would suggest modelling those as first class query
>>>>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 for this approach
>>>>>>>>>>>
>>>>>>>>>>> I think it's good to use query params because it can be
>>>>>>>>>>> backward-compatible with the current behavior. If you get more than 
>>>>>>>>>>> the
>>>>>>>>>>> limit back, then the service probably doesn't support pagination. 
>>>>>>>>>>> And if a
>>>>>>>>>>> client doesn't support pagination they get the same results that 
>>>>>>>>>>> they would
>>>>>>>>>>> today. A streaming approach with a continuation link like in the 
>>>>>>>>>>> scan API
>>>>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>>>>> second
>>>>>>>>>>> request.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> During the conversation of the Scan API for REST spec, we
>>>>>>>>>>>> touched on the topic of pagination when REST response is large or 
>>>>>>>>>>>> takes
>>>>>>>>>>>> time to be produced.
>>>>>>>>>>>>
>>>>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating 
>>>>>>>>>>>> with a
>>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot 
>>>>>>>>>>>> of tables
>>>>>>>>>>>> in some namespaces.
>>>>>>>>>>>>
>>>>>>>>>>>> Pagination requires either keeping state, or the response to be
>>>>>>>>>>>> deterministic such that the client can request a range of the full
>>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to 
>>>>>>>>>>>> allow some
>>>>>>>>>>>> query parameters like:
>>>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>>>
>>>>>>>>>>>> So we can send a request like:
>>>>>>>>>>>>
>>>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>>>
>>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>>>
>>>>>>>>>>>> And the REST spec should enforce that the response returned for
>>>>>>>>>>>> the paginated GET should be deterministic.
>>>>>>>>>>>>
>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Pagination for List APIs in the REST spec

Reply via email to