Yes I agree that it is better to not enforce the implementation to favor
any direction, and continuation token is probably better than enforcing a
numeric start index.

In addition to the start index approach, another potential simple way to
implement the continuation token is to use the last item name, when the
listing is guaranteed to be in lexicographic order. Compared to the start
index approach, it does not need to worry about the change of start index
when something in the list is added or removed.

However, the issue of concurrent modification could still exist even with a
continuation token. For example, if an element is added before the
continuation token, then all future listing calls with the token would
always skip that element. If we want to enforce that level of atomicity, we
probably want to introduce another time travel query parameter (e.g.
asOf=1703003028000) to ensure that we are listing results at a specific
point of time of the warehouse, so the complete result list is fixed. (This
is also the missing piece I forgot to mention in the start index approach
to ensure it works in distributed settings)

-Jack

On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> wrote:

> I tried to cover these in more details at:
> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>
> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
>
>> +1 for this approach. I agree that the streaming approach requires that
>> http client and servers have http 2 streaming support, which is not
>> compatible with old clients.
>>
>> I share the same concern with Micah that only start/limit may not be
>> enough in a distributed environment where modification happens during
>> iterations. For compatibility, we need to consider several cases:
>>
>> 1. Old client <-> New Server
>> 2. New client <-> Old server
>>
>>
>>
>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> I agree that we want to include this feature and I raised similar
>>> concerns to what Micah already presented in talking with Ryan.
>>>
>>> For backward compatibility, just adding a start and limit implies a
>>> deterministic order, which is not a current requirement of the REST spec.
>>>
>>> Also, we need to consider whether the start/limit would need to be
>>> respected by the server.  If existing implementations simply return all the
>>> results, will that be sufficient?  There are a few edge cases that need to
>>> be considered here.
>>>
>>> For the opaque key approach, I think adding a query param to
>>> trigger/continue and introducing a continuation token in
>>> the ListNamespacesResponse might allow for more backward compatibility.  In
>>> that scenario, pagination would only take place for clients who know how to
>>> paginate and the ordering would not need to be deterministic.
>>>
>>> -Dan
>>>
>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Just to clarify and add a small suggestion:
>>>>
>>>> The behavior with no additional parameters requires the operations to
>>>> happen as they do today for backwards compatibility (i.e either all
>>>> responses are returned or a failure occurs).
>>>>
>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>> specific numeric offset) that can be returned by the service and a limit
>>>> (as proposed above). If a start token is provided without a limit a
>>>> default limit can be chosen by the server.  Servers might return less than
>>>> limit (i.e. clients are required to check for a next token to determine if
>>>> iteration is complete).  This enables server side state if it is desired
>>>> but also makes deterministic listing much more feasible (deterministic
>>>> responses are essentially impossible in the face of changing data if only a
>>>> start offset is provided).
>>>>
>>>> In an ideal world, specifying a limit would result in streaming
>>>> responses being returned with the last part either containing a token if
>>>> continuation is necessary.  Given conversation on the other thread of
>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>> service.
>>>>
>>>> Therefore it seems like using pagination with token and offset would be
>>>> preferred.  If skipping someplace in the middle of the namespaces is
>>>> required then I would suggest modelling those as first class query
>>>> parameters (e.g. "startAfterNamespace")
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> +1 for this approach
>>>>>
>>>>> I think it's good to use query params because it can be
>>>>> backward-compatible with the current behavior. If you get more than the
>>>>> limit back, then the service probably doesn't support pagination. And if a
>>>>> client doesn't support pagination they get the same results that they 
>>>>> would
>>>>> today. A streaming approach with a continuation link like in the scan API
>>>>> discussion wouldn't work because old clients don't know to make a second
>>>>> request.
>>>>>
>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> During the conversation of the Scan API for REST spec, we touched on
>>>>>> the topic of pagination when REST response is large or takes time to be
>>>>>> produced.
>>>>>>
>>>>>> I just want to discuss this separately, since we also see the issue
>>>>>> for ListNamespaces and ListTables/Views, when integrating with a large
>>>>>> organization that has over 100k namespaces, and also a lot of tables in
>>>>>> some namespaces.
>>>>>>
>>>>>> Pagination requires either keeping state, or the response to be
>>>>>> deterministic such that the client can request a range of the full
>>>>>> response. If we want to avoid keeping state, I think we need to allow 
>>>>>> some
>>>>>> query parameters like:
>>>>>> - *start*: the start index of the item in the response
>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>
>>>>>> So we can send a request like:
>>>>>>
>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>
>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>
>>>>>> And the REST spec should enforce that the response returned for the
>>>>>> paginated GET should be deterministic.
>>>>>>
>>>>>> Any thoughts on this?
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

Reply via email to