Can we assume it is the responsibility of the server to ensure determinism
(e.g., by caching the results along with query ID)? I think start and
offset has the advantage of being parallelizable (as compared to
continuation tokens). On the other hand, using "asOf" can be complex to
 implement and may be too powerful for the pagination use case (because it
allows to query the warehouse as of any point of time, not just now).

Thanks,
Walaa.

On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:

> I think you can solve the atomicity problem with a continuation token and
> server-side state. In general, I don't think this is a problem we should
> worry about a lot since pagination commonly has this problem. But since we
> can build a system that allows you to solve it if you choose to, we should
> go with that design.
>
> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Jack,
>> Some answers inline.
>>
>>
>>> In addition to the start index approach, another potential simple way to
>>> implement the continuation token is to use the last item name, when the
>>> listing is guaranteed to be in lexicographic order.
>>
>>
>> I think this is one viable implementation, but the reason that the token
>> should be opaque is that it allows several different implementations
>> without client side changes.
>>
>> For example, if an element is added before the continuation token, then
>>> all future listing calls with the token would always skip that element.
>>
>>
>> IMO, I think this is fine, for some of the REST APIs it is likely
>> important to put constraints on atomicity requirements, for others (e.g.
>> list namespaces) I think it is OK to have looser requirements.
>>
>> If we want to enforce that level of atomicity, we probably want to
>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to
>>> ensure that we are listing results at a specific point of time of the
>>> warehouse, so the complete result list is fixed.
>>
>>
>> Time travel might be useful in some cases but I think it is orthogonal to
>> services wishing to have guarantees around  atomicity/consistency of
>> results.  If a server wants to ensure that results are atomic/consistent as
>> of the start of the listing, it can embed the necessary timestamp in the
>> token it returns and parse it out when fetching the next result.
>>
>> I think this does raise a more general point around service definition
>> evolution in general.  I think there likely need to be metadata endpoints
>> that expose either:
>> 1.  A version of the REST API supported.
>> 2.  Features the API supports (e.g. which query parameters are honored
>> for a specific endpoint).
>>
>> There are pros and cons to both approaches (apologies if I missed this in
>> the spec or if it has already been discussed).
>>
>> Cheers,
>> Micah
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Yes I agree that it is better to not enforce the implementation to favor
>>> any direction, and continuation token is probably better than enforcing a
>>> numeric start index.
>>>
>>> In addition to the start index approach, another potential simple way to
>>> implement the continuation token is to use the last item name, when the
>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>> index approach, it does not need to worry about the change of start index
>>> when something in the list is added or removed.
>>>
>>> However, the issue of concurrent modification could still exist even
>>> with a continuation token. For example, if an element is added before the
>>> continuation token, then all future listing calls with the token would
>>> always skip that element. If we want to enforce that level of atomicity, we
>>> probably want to introduce another time travel query parameter (e.g.
>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>> point of time of the warehouse, so the complete result list is fixed. (This
>>> is also the missing piece I forgot to mention in the start index approach
>>> to ensure it works in distributed settings)
>>>
>>> -Jack
>>>
>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> I tried to cover these in more details at:
>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>
>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>> that http client and servers have http 2 streaming support, which is not
>>>>> compatible with old clients.
>>>>>
>>>>> I share the same concern with Micah that only start/limit may not be
>>>>> enough in a distributed environment where modification happens during
>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>
>>>>> 1. Old client <-> New Server
>>>>> 2. New client <-> Old server
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I agree that we want to include this feature and I raised similar
>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>
>>>>>> For backward compatibility, just adding a start and limit implies a
>>>>>> deterministic order, which is not a current requirement of the REST spec.
>>>>>>
>>>>>> Also, we need to consider whether the start/limit would need to be
>>>>>> respected by the server.  If existing implementations simply return all 
>>>>>> the
>>>>>> results, will that be sufficient?  There are a few edge cases that need 
>>>>>> to
>>>>>> be considered here.
>>>>>>
>>>>>> For the opaque key approach, I think adding a query param to
>>>>>> trigger/continue and introducing a continuation token in
>>>>>> the ListNamespacesResponse might allow for more backward compatibility.  
>>>>>> In
>>>>>> that scenario, pagination would only take place for clients who know how 
>>>>>> to
>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Just to clarify and add a small suggestion:
>>>>>>>
>>>>>>> The behavior with no additional parameters requires the operations
>>>>>>> to happen as they do today for backwards compatibility (i.e either all
>>>>>>> responses are returned or a failure occurs).
>>>>>>>
>>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>>> specific numeric offset) that can be returned by the service and a limit
>>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>>> default limit can be chosen by the server.  Servers might return less 
>>>>>>> than
>>>>>>> limit (i.e. clients are required to check for a next token to determine 
>>>>>>> if
>>>>>>> iteration is complete).  This enables server side state if it is desired
>>>>>>> but also makes deterministic listing much more feasible (deterministic
>>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>>> only a
>>>>>>> start offset is provided).
>>>>>>>
>>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>>> responses being returned with the last part either containing a token if
>>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>>>>> service.
>>>>>>>
>>>>>>> Therefore it seems like using pagination with token and offset would
>>>>>>> be preferred.  If skipping someplace in the middle of the namespaces is
>>>>>>> required then I would suggest modelling those as first class query
>>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Micah
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> +1 for this approach
>>>>>>>>
>>>>>>>> I think it's good to use query params because it can be
>>>>>>>> backward-compatible with the current behavior. If you get more than the
>>>>>>>> limit back, then the service probably doesn't support pagination. And 
>>>>>>>> if a
>>>>>>>> client doesn't support pagination they get the same results that they 
>>>>>>>> would
>>>>>>>> today. A streaming approach with a continuation link like in the scan 
>>>>>>>> API
>>>>>>>> discussion wouldn't work because old clients don't know to make a 
>>>>>>>> second
>>>>>>>> request.
>>>>>>>>
>>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> During the conversation of the Scan API for REST spec, we touched
>>>>>>>>> on the topic of pagination when REST response is large or takes time 
>>>>>>>>> to be
>>>>>>>>> produced.
>>>>>>>>>
>>>>>>>>> I just want to discuss this separately, since we also see the
>>>>>>>>> issue for ListNamespaces and ListTables/Views, when integrating with a
>>>>>>>>> large organization that has over 100k namespaces, and also a lot of 
>>>>>>>>> tables
>>>>>>>>> in some namespaces.
>>>>>>>>>
>>>>>>>>> Pagination requires either keeping state, or the response to be
>>>>>>>>> deterministic such that the client can request a range of the full
>>>>>>>>> response. If we want to avoid keeping state, I think we need to allow 
>>>>>>>>> some
>>>>>>>>> query parameters like:
>>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>>
>>>>>>>>> So we can send a request like:
>>>>>>>>>
>>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>>
>>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>>
>>>>>>>>> And the REST spec should enforce that the response returned for
>>>>>>>>> the paginated GET should be deterministic.
>>>>>>>>>
>>>>>>>>> Any thoughts on this?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jack Ye
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to