Re: Pagination for List APIs in the REST spec

Jack Ye Tue, 19 Dec 2023 10:30:18 -0800

Yes I think the continuation token should in general be opaque. I was
trying to give an example of an easy  implementation, since there were some
general concerns that the features proposed should not be too complicated
to implement, to some extent.


I also agree the asOf feature can be embedded in the token string if the
server can support it.

So sounds like we have a general consensus around this topic that we should
add a continuation token and a limit as query parameters for the list APIs?

Regarding backwards compatibility of the evolution of the REST spec, I
agree with Walaa that it would be nice if we have a consistent way to
describe the supported features of the endpoint. We have some configs
returned during the GetConfig API like the support for metrics API. It is
also the way we used to prototype changes in Scan and UpdateTable APIs. But
those are big features with Boolean configs, we might want some consistent
ways to describe small things like supported query parameters, request
response versions, etc. through GetConfig.

-Jack

On Tue, Dec 19, 2023, 11:55 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Not necessarily. That is more of a general statement. The pagination
> discussion forked from server side scan planning.
>
> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote:
>
>> > With start/limit each client can query for own's chunk without
>> coordination.
>>
>> Okay, I understand now. Would you need to parallelize the client for
>> listing namespaces or tables? That seems odd to me.
>>
>> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> > You can parallelize with opaque tokens by sending a starting point for
>>> the next request.
>>>
>>> I meant we would have to wait for the server to return this starting
>>> point from the past request? With start/limit each client can query for
>>> own's chunk without coordination.
>>>
>>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> > I think start and offset has the advantage of being parallelizable
>>>> (as compared to continuation tokens).
>>>>
>>>> You can parallelize with opaque tokens by sending a starting point for
>>>> the next request.
>>>>
>>>> > On the other hand, using "asOf" can be complex to  implement and may
>>>> be too powerful for the pagination use case
>>>>
>>>> I don't think that we want to add `asOf`. If the service chooses to do
>>>> this, it would send a continuation token that has the information embedded.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Can we assume it is the responsibility of the server to ensure
>>>>> determinism (e.g., by caching the results along with query ID)? I think
>>>>> start and offset has the advantage of being parallelizable (as compared to
>>>>> continuation tokens). On the other hand, using "asOf" can be complex to
>>>>>  implement and may be too powerful for the pagination use case (because it
>>>>> allows to query the warehouse as of any point of time, not just now).
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I think you can solve the atomicity problem with a continuation token
>>>>>> and server-side state. In general, I don't think this is a problem we
>>>>>> should worry about a lot since pagination commonly has this problem. But
>>>>>> since we can build a system that allows you to solve it if you choose to,
>>>>>> we should go with that design.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <
>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Jack,
>>>>>>> Some answers inline.
>>>>>>>
>>>>>>>
>>>>>>>> In addition to the start index approach, another potential simple
>>>>>>>> way to implement the continuation token is to use the last item name, 
>>>>>>>> when
>>>>>>>> the listing is guaranteed to be in lexicographic order.
>>>>>>>
>>>>>>>
>>>>>>> I think this is one viable implementation, but the reason that the
>>>>>>> token should be opaque is that it allows several different 
>>>>>>> implementations
>>>>>>> without client side changes.
>>>>>>>
>>>>>>> For example, if an element is added before the continuation token,
>>>>>>>> then all future listing calls with the token would always skip that
>>>>>>>> element.
>>>>>>>
>>>>>>>
>>>>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>>>>> list namespaces) I think it is OK to have looser requirements.
>>>>>>>
>>>>>>> If we want to enforce that level of atomicity, we probably want to
>>>>>>>> introduce another time travel query parameter (e.g. 
>>>>>>>> asOf=1703003028000) to
>>>>>>>> ensure that we are listing results at a specific point of time of the
>>>>>>>> warehouse, so the complete result list is fixed.
>>>>>>>
>>>>>>>
>>>>>>> Time travel might be useful in some cases but I think it is
>>>>>>> orthogonal to services wishing to have guarantees around
>>>>>>> atomicity/consistency of results.  If a server wants to ensure that 
>>>>>>> results
>>>>>>> are atomic/consistent as of the start of the listing, it can embed the
>>>>>>> necessary timestamp in the token it returns and parse it out when 
>>>>>>> fetching
>>>>>>> the next result.
>>>>>>>
>>>>>>> I think this does raise a more general point around service
>>>>>>> definition evolution in general.  I think there likely need to be 
>>>>>>> metadata
>>>>>>> endpoints that expose either:
>>>>>>> 1.  A version of the REST API supported.
>>>>>>> 2.  Features the API supports (e.g. which query parameters are
>>>>>>> honored for a specific endpoint).
>>>>>>>
>>>>>>> There are pros and cons to both approaches (apologies if I missed
>>>>>>> this in the spec or if it has already been discussed).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Micah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes I agree that it is better to not enforce the implementation to
>>>>>>>> favor any direction, and continuation token is probably better than
>>>>>>>> enforcing a numeric start index.
>>>>>>>>
>>>>>>>> In addition to the start index approach, another potential simple
>>>>>>>> way to implement the continuation token is to use the last item name, 
>>>>>>>> when
>>>>>>>> the listing is guaranteed to be in lexicographic order. Compared to the
>>>>>>>> start index approach, it does not need to worry about the change of 
>>>>>>>> start
>>>>>>>> index when something in the list is added or removed.
>>>>>>>>
>>>>>>>> However, the issue of concurrent modification could still exist
>>>>>>>> even with a continuation token. For example, if an element is added 
>>>>>>>> before
>>>>>>>> the continuation token, then all future listing calls with the token 
>>>>>>>> would
>>>>>>>> always skip that element. If we want to enforce that level of 
>>>>>>>> atomicity, we
>>>>>>>> probably want to introduce another time travel query parameter (e.g.
>>>>>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>>>>>> point of time of the warehouse, so the complete result list is fixed. 
>>>>>>>> (This
>>>>>>>> is also the missing piece I forgot to mention in the start index 
>>>>>>>> approach
>>>>>>>> to ensure it works in distributed settings)
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <
>>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I tried to cover these in more details at:
>>>>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>>>>
>>>>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <
>>>>>>>>> liurenjie2...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1 for this approach. I agree that the streaming approach
>>>>>>>>>> requires that http client and servers have http 2 streaming support, 
>>>>>>>>>> which
>>>>>>>>>> is not compatible with old clients.
>>>>>>>>>>
>>>>>>>>>> I share the same concern with Micah that only start/limit may not
>>>>>>>>>> be enough in a distributed environment where modification happens 
>>>>>>>>>> during
>>>>>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>>>>>
>>>>>>>>>> 1. Old client <-> New Server
>>>>>>>>>> 2. New client <-> Old server
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree that we want to include this feature and I raised
>>>>>>>>>>> similar concerns to what Micah already presented in talking with 
>>>>>>>>>>> Ryan.
>>>>>>>>>>>
>>>>>>>>>>> For backward compatibility, just adding a start and limit
>>>>>>>>>>> implies a deterministic order, which is not a current requirement 
>>>>>>>>>>> of the
>>>>>>>>>>> REST spec.
>>>>>>>>>>>
>>>>>>>>>>> Also, we need to consider whether the start/limit would need to
>>>>>>>>>>> be respected by the server.  If existing implementations simply 
>>>>>>>>>>> return all
>>>>>>>>>>> the results, will that be sufficient?  There are a few edge cases 
>>>>>>>>>>> that need
>>>>>>>>>>> to be considered here.
>>>>>>>>>>>
>>>>>>>>>>> For the opaque key approach, I think adding a query param to
>>>>>>>>>>> trigger/continue and introducing a continuation token in
>>>>>>>>>>> the ListNamespacesResponse might allow for more backward 
>>>>>>>>>>> compatibility.  In
>>>>>>>>>>> that scenario, pagination would only take place for clients who 
>>>>>>>>>>> know how to
>>>>>>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>>>>>>
>>>>>>>>>>> -Dan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <
>>>>>>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>>>>>
>>>>>>>>>>>> The behavior with no additional parameters requires the
>>>>>>>>>>>> operations to happen as they do today for backwards compatibility 
>>>>>>>>>>>> (i.e
>>>>>>>>>>>> either all responses are returned or a failure occurs).
>>>>>>>>>>>>
>>>>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead
>>>>>>>>>>>> of specific numeric offset) that can be returned by the service 
>>>>>>>>>>>> and a limit
>>>>>>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>>>>>>> default limit can be chosen by the server.  Servers might return 
>>>>>>>>>>>> less than
>>>>>>>>>>>> limit (i.e. clients are required to check for a next token to 
>>>>>>>>>>>> determine if
>>>>>>>>>>>> iteration is complete).  This enables server side state if it is 
>>>>>>>>>>>> desired
>>>>>>>>>>>> but also makes deterministic listing much more feasible 
>>>>>>>>>>>> (deterministic
>>>>>>>>>>>> responses are essentially impossible in the face of changing data 
>>>>>>>>>>>> if only a
>>>>>>>>>>>> start offset is provided).
>>>>>>>>>>>>
>>>>>>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>>>>>>> responses being returned with the last part either containing a 
>>>>>>>>>>>> token if
>>>>>>>>>>>> continuation is necessary.  Given conversation on the other thread 
>>>>>>>>>>>> of
>>>>>>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API 
>>>>>>>>>>>> REST
>>>>>>>>>>>> service.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore it seems like using pagination with token and offset
>>>>>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>>>>>> namespaces
>>>>>>>>>>>> is required then I would suggest modelling those as first class 
>>>>>>>>>>>> query
>>>>>>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Micah
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 for this approach
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's good to use query params because it can be
>>>>>>>>>>>>> backward-compatible with the current behavior. If you get more 
>>>>>>>>>>>>> than the
>>>>>>>>>>>>> limit back, then the service probably doesn't support pagination. 
>>>>>>>>>>>>> And if a
>>>>>>>>>>>>> client doesn't support pagination they get the same results that 
>>>>>>>>>>>>> they would
>>>>>>>>>>>>> today. A streaming approach with a continuation link like in the 
>>>>>>>>>>>>> scan API
>>>>>>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>>>>>>> second
>>>>>>>>>>>>> request.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> During the conversation of the Scan API for REST spec, we
>>>>>>>>>>>>>> touched on the topic of pagination when REST response is large 
>>>>>>>>>>>>>> or takes
>>>>>>>>>>>>>> time to be produced.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating 
>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>> large organization that has over 100k namespaces, and also a lot 
>>>>>>>>>>>>>> of tables
>>>>>>>>>>>>>> in some namespaces.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pagination requires either keeping state, or the response to
>>>>>>>>>>>>>> be deterministic such that the client can request a range of the 
>>>>>>>>>>>>>> full
>>>>>>>>>>>>>> response. If we want to avoid keeping state, I think we need to 
>>>>>>>>>>>>>> allow some
>>>>>>>>>>>>>> query parameters like:
>>>>>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So we can send a request like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And the REST spec should enforce that the response returned
>>>>>>>>>>>>>> for the paginated GET should be deterministic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Pagination for List APIs in the REST spec

Reply via email to