Re: Pagination for List APIs in the REST spec

Xuanwo Tue, 19 Dec 2023 19:09:21 -0800

> For the continuation token, I think one missing part is about the expiration 
> time of this token, since this may affect the state cleaning process of the 
> server.


Some storage services use a continuation token as a binary representation of 
internal states. For example, they serialize a structure into binary and then 
perform base64 encoding. Services don't need to maintain state, eliminating the 
need for state cleaning.

> Do servers need to expose the expiration time to clients? 

If we choose to manage state on the server side, I recommend not revealing the 
expiration time to the client, at least not for now. We can introduce it when 
there's a practical need. It wouldn't constitute a breaking change, would it?

On Wed, Dec 20, 2023, at 10:57, Renjie Liu wrote:
> For the continuation token, I think one missing part is about the expiration 
> time of this token, since this may affect the state cleaning process of the 
> server. There are several things to discuss:
> 
> 1. Should we leave it to the server to decide it or allow the client to 
> config in api? 
> 
> Personally I think it would be enough for the server to determine it for now, 
> since I don't see any usage to allow clients to set the expiration time in 
> api.
> 
> 2. Do servers need to expose the expiration time to clients? 
> 
> Personally I think it would be enough to expose this through the getConfig 
> api to let users know this. For now there is no requirement for per request 
> expiration time.
> 
> On Wed, Dec 20, 2023 at 2:49 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>> IMO, parallelization needs to be a first class entity in the end 
>> point/service design to allow for flexibility (I scanned through the 
>> original proposal for the scan planning and it looked like it was on the 
>> right track).  Using offsets for parallelization is problematic from both a 
>> consistency and scalability perspective if you want to allow for flexibility 
>> in implementation.
>> 
>> In particular, I think the server needs an APIs like:
>> 
>> DoScan - returns a list of partitions (represented by an opaque entity).  
>> The list of partitions should support pagination (in an ideal world, it 
>> would be streaming).
>> GetTasksForPartition - Returns scan tasks for a partition (should also be 
>> paginated/streaming, but this is up for debate).  I think it is an important 
>> consideration to allow for empty partitions.
>> 
>> With this implementation you don't necessarily require separate server side 
>> state (objects in GCS should be sufficient), I think as Ryan suggested, one 
>> implementation could be to have each partition correspond to a byte-range in 
>> a manifest file for returning the tasks.
>> 
>> Thanks,
>> Micah
>> 
>> 
>> 
>> On Tue, Dec 19, 2023 at 9:55 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> 
>> wrote:
>>> Not necessarily. That is more of a general statement. The pagination 
>>> discussion forked from server side scan planning.
>>> 
>>> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote:
>>>> > With start/limit each client can query for own's chunk without 
>>>> > coordination.
>>>> 
>>>> Okay, I understand now. Would you need to parallelize the client for 
>>>> listing namespaces or tables? That seems odd to me.
>>>> 
>>>> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa 
>>>> <wa.moust...@gmail.com> wrote:
>>>>> > You can parallelize with opaque tokens by sending a starting point for 
>>>>> > the next request.
>>>>> 
>>>>> I meant we would have to wait for the server to return this starting 
>>>>> point from the past request? With start/limit each client can query for 
>>>>> own's chunk without coordination.
>>>>> 
>>>>> 
>>>>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>> > I think start and offset has the advantage of being parallelizable (as 
>>>>>> > compared to continuation tokens).
>>>>>> 
>>>>>> You can parallelize with opaque tokens by sending a starting point for 
>>>>>> the next request.
>>>>>> 
>>>>>> > On the other hand, using "asOf" can be complex to  implement and may 
>>>>>> > be too powerful for the pagination use case
>>>>>> 
>>>>>> I don't think that we want to add `asOf`. If the service chooses to do 
>>>>>> this, it would send a continuation token that has the information 
>>>>>> embedded.
>>>>>> 
>>>>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa 
>>>>>> <wa.moust...@gmail.com> wrote:
>>>>>>> Can we assume it is the responsibility of the server to ensure 
>>>>>>> determinism (e.g., by caching the results along with query ID)? I think 
>>>>>>> start and offset has the advantage of being parallelizable (as compared 
>>>>>>> to continuation tokens). On the other hand, using "asOf" can be complex 
>>>>>>> to  implement and may be too powerful for the pagination use case 
>>>>>>> (because it allows to query the warehouse as of any point of time, not 
>>>>>>> just now). 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>> 
>>>>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>> I think you can solve the atomicity problem with a continuation token 
>>>>>>>> and server-side state. In general, I don't think this is a problem we 
>>>>>>>> should worry about a lot since pagination commonly has this problem. 
>>>>>>>> But since we can build a system that allows you to solve it if you 
>>>>>>>> choose to, we should go with that design.
>>>>>>>> 
>>>>>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield 
>>>>>>>> <emkornfi...@gmail.com> wrote:
>>>>>>>>> Hi Jack,
>>>>>>>>> Some answers inline.
>>>>>>>>>  
>>>>>>>>>> In addition to the start index approach, another potential simple 
>>>>>>>>>> way to implement the continuation token is to use the last item 
>>>>>>>>>> name, when the listing is guaranteed to be in lexicographic order. 
>>>>>>>>> 
>>>>>>>>> I think this is one viable implementation, but the reason that the 
>>>>>>>>> token should be opaque is that it allows several different 
>>>>>>>>> implementations without client side changes.
>>>>>>>>> 
>>>>>>>>>> For example, if an element is added before the continuation token, 
>>>>>>>>>> then all future listing calls with the token would always skip that 
>>>>>>>>>> element. 
>>>>>>>>> 
>>>>>>>>> IMO, I think this is fine, for some of the REST APIs it is likely 
>>>>>>>>> important to put constraints on atomicity requirements, for others 
>>>>>>>>> (e.g. list namespaces) I think it is OK to have looser requirements.
>>>>>>>>> 
>>>>>>>>>> If we want to enforce that level of atomicity, we probably want to 
>>>>>>>>>> introduce another time travel query parameter (e.g. 
>>>>>>>>>> asOf=1703003028000) to ensure that we are listing results at a 
>>>>>>>>>> specific point of time of the warehouse, so the complete result list 
>>>>>>>>>> is fixed. 
>>>>>>>>> 
>>>>>>>>> Time travel might be useful in some cases but I think it is 
>>>>>>>>> orthogonal to services wishing to have guarantees around  
>>>>>>>>> atomicity/consistency of results.  If a server wants to ensure that 
>>>>>>>>> results are atomic/consistent as of the start of the listing, it can 
>>>>>>>>> embed the necessary timestamp in the token it returns and parse it 
>>>>>>>>> out when fetching the next result.
>>>>>>>>> 
>>>>>>>>> I think this does raise a more general point around service 
>>>>>>>>> definition evolution in general.  I think there likely need to be 
>>>>>>>>> metadata endpoints that expose either:
>>>>>>>>> 1.  A version of the REST API supported.
>>>>>>>>> 2.  Features the API supports (e.g. which query parameters are 
>>>>>>>>> honored for a specific endpoint).
>>>>>>>>> 
>>>>>>>>> There are pros and cons to both approaches (apologies if I missed 
>>>>>>>>> this in the spec or if it has already been discussed).
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Micah
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>   
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>>>> Yes I agree that it is better to not enforce the implementation to 
>>>>>>>>>> favor any direction, and continuation token is probably better than 
>>>>>>>>>> enforcing a numeric start index. 
>>>>>>>>>> 
>>>>>>>>>> In addition to the start index approach, another potential simple 
>>>>>>>>>> way to implement the continuation token is to use the last item 
>>>>>>>>>> name, when the listing is guaranteed to be in lexicographic order. 
>>>>>>>>>> Compared to the start index approach, it does not need to worry 
>>>>>>>>>> about the change of start index when something in the list is added 
>>>>>>>>>> or removed. 
>>>>>>>>>> 
>>>>>>>>>> However, the issue of concurrent modification could still exist even 
>>>>>>>>>> with a continuation token. For example, if an element is added 
>>>>>>>>>> before the continuation token, then all future listing calls with 
>>>>>>>>>> the token would always skip that element. If we want to enforce that 
>>>>>>>>>> level of atomicity, we probably want to introduce another time 
>>>>>>>>>> travel query parameter (e.g. asOf=1703003028000) to ensure that we 
>>>>>>>>>> are listing results at a specific point of time of the warehouse, so 
>>>>>>>>>> the complete result list is fixed. (This is also the missing piece I 
>>>>>>>>>> forgot to mention in the start index approach to ensure it works in 
>>>>>>>>>> distributed settings)
>>>>>>>>>> 
>>>>>>>>>> -Jack
>>>>>>>>>> 
>>>>>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield 
>>>>>>>>>> <emkornfi...@gmail.com> wrote:
>>>>>>>>>>> I tried to cover these in more details at: 
>>>>>>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu 
>>>>>>>>>>> <liurenjie2...@gmail.com> wrote:
>>>>>>>>>>>> +1 for this approach. I agree that the streaming approach requires 
>>>>>>>>>>>> that http client and servers have http 2 streaming support, which 
>>>>>>>>>>>> is not compatible with old clients.
>>>>>>>>>>>> 
>>>>>>>>>>>> I share the same concern with Micah that only start/limit may not 
>>>>>>>>>>>> be enough in a distributed environment where modification happens 
>>>>>>>>>>>> during iterations. For compatibility, we need to consider several 
>>>>>>>>>>>> cases:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Old client <-> New Server
>>>>>>>>>>>> 2. New client <-> Old server
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> I agree that we want to include this feature and I raised similar 
>>>>>>>>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For backward compatibility, just adding a start and limit implies 
>>>>>>>>>>>>> a deterministic order, which is not a current requirement of the 
>>>>>>>>>>>>> REST spec.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, we need to consider whether the start/limit would need to 
>>>>>>>>>>>>> be respected by the server.  If existing implementations simply 
>>>>>>>>>>>>> return all the results, will that be sufficient?  There are a few 
>>>>>>>>>>>>> edge cases that need to be considered here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the opaque key approach, I think adding a query param to 
>>>>>>>>>>>>> trigger/continue and introducing a continuation token in the 
>>>>>>>>>>>>> ListNamespacesResponse might allow for more backward 
>>>>>>>>>>>>> compatibility.  In that scenario, pagination would only take 
>>>>>>>>>>>>> place for clients who know how to paginate and the ordering would 
>>>>>>>>>>>>> not need to be deterministic.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Dan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield 
>>>>>>>>>>>>> <emkornfi...@gmail.com> wrote:
>>>>>>>>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The behavior with no additional parameters requires the 
>>>>>>>>>>>>>> operations to happen as they do today for backwards 
>>>>>>>>>>>>>> compatibility (i.e either all responses are returned or a 
>>>>>>>>>>>>>> failure occurs).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For new parameters, I'd suggest an opaque start token (instead 
>>>>>>>>>>>>>> of specific numeric offset) that can be returned by the service 
>>>>>>>>>>>>>> and a limit (as proposed above). If a start token is provided 
>>>>>>>>>>>>>> without a limit a default limit can be chosen by the server.  
>>>>>>>>>>>>>> Servers might return less than limit (i.e. clients are required 
>>>>>>>>>>>>>> to check for a next token to determine if iteration is 
>>>>>>>>>>>>>> complete).  This enables server side state if it is desired but 
>>>>>>>>>>>>>> also makes deterministic listing much more feasible 
>>>>>>>>>>>>>> (deterministic responses are essentially impossible in the face 
>>>>>>>>>>>>>> of changing data if only a start offset is provided).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In an ideal world, specifying a limit would result in streaming 
>>>>>>>>>>>>>> responses being returned with the last part either containing a 
>>>>>>>>>>>>>> token if continuation is necessary.  Given conversation on the 
>>>>>>>>>>>>>> other thread of streaming, I'd imagine this is quite hard to 
>>>>>>>>>>>>>> model in an Open API REST service.  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Therefore it seems like using pagination with token and offset 
>>>>>>>>>>>>>> would be preferred.  If skipping someplace in the middle of the 
>>>>>>>>>>>>>> namespaces is required then I would suggest modelling those as 
>>>>>>>>>>>>>> first class query parameters (e.g. "startAfterNamespace")
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> +1 for this approach
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think it's good to use query params because it can be 
>>>>>>>>>>>>>>> backward-compatible with the current behavior. If you get more 
>>>>>>>>>>>>>>> than the limit back, then the service probably doesn't support 
>>>>>>>>>>>>>>> pagination. And if a client doesn't support pagination they get 
>>>>>>>>>>>>>>> the same results that they would today. A streaming approach 
>>>>>>>>>>>>>>> with a continuation link like in the scan API discussion 
>>>>>>>>>>>>>>> wouldn't work because old clients don't know to make a second 
>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> During the conversation of the Scan API for REST spec, we 
>>>>>>>>>>>>>>>> touched on the topic of pagination when REST response is large 
>>>>>>>>>>>>>>>> or takes time to be produced.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I just want to discuss this separately, since we also see the 
>>>>>>>>>>>>>>>> issue for ListNamespaces and ListTables/Views, when 
>>>>>>>>>>>>>>>> integrating with a large organization that has over 100k 
>>>>>>>>>>>>>>>> namespaces, and also a lot of tables in some namespaces.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Pagination requires either keeping state, or the response to 
>>>>>>>>>>>>>>>> be deterministic such that the client can request a range of 
>>>>>>>>>>>>>>>> the full response. If we want to avoid keeping state, I think 
>>>>>>>>>>>>>>>> we need to allow some query parameters like:
>>>>>>>>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So we can send a request like:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> And the REST spec should enforce that the response returned 
>>>>>>>>>>>>>>>> for the paginated GET should be deterministic.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Tabular
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Tabular

Xuanwo

Re: Pagination for List APIs in the REST spec

Reply via email to