I think you can solve the atomicity problem with a continuation token and
server-side state. In general, I don't think this is a problem we should
worry about a lot since pagination commonly has this problem. But since we
can build a system that allows you to solve it if you choose to, we should
go with that design.

On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Jack,
> Some answers inline.
>
>
>> In addition to the start index approach, another potential simple way to
>> implement the continuation token is to use the last item name, when the
>> listing is guaranteed to be in lexicographic order.
>
>
> I think this is one viable implementation, but the reason that the token
> should be opaque is that it allows several different implementations
> without client side changes.
>
> For example, if an element is added before the continuation token, then
>> all future listing calls with the token would always skip that element.
>
>
> IMO, I think this is fine, for some of the REST APIs it is likely
> important to put constraints on atomicity requirements, for others (e.g.
> list namespaces) I think it is OK to have looser requirements.
>
> If we want to enforce that level of atomicity, we probably want to
>> introduce another time travel query parameter (e.g. asOf=1703003028000) to
>> ensure that we are listing results at a specific point of time of the
>> warehouse, so the complete result list is fixed.
>
>
> Time travel might be useful in some cases but I think it is orthogonal to
> services wishing to have guarantees around  atomicity/consistency of
> results.  If a server wants to ensure that results are atomic/consistent as
> of the start of the listing, it can embed the necessary timestamp in the
> token it returns and parse it out when fetching the next result.
>
> I think this does raise a more general point around service definition
> evolution in general.  I think there likely need to be metadata endpoints
> that expose either:
> 1.  A version of the REST API supported.
> 2.  Features the API supports (e.g. which query parameters are honored for
> a specific endpoint).
>
> There are pros and cons to both approaches (apologies if I missed this in
> the spec or if it has already been discussed).
>
> Cheers,
> Micah
>
>
>
>
>
>
> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Yes I agree that it is better to not enforce the implementation to favor
>> any direction, and continuation token is probably better than enforcing a
>> numeric start index.
>>
>> In addition to the start index approach, another potential simple way to
>> implement the continuation token is to use the last item name, when the
>> listing is guaranteed to be in lexicographic order. Compared to the start
>> index approach, it does not need to worry about the change of start index
>> when something in the list is added or removed.
>>
>> However, the issue of concurrent modification could still exist even with
>> a continuation token. For example, if an element is added before the
>> continuation token, then all future listing calls with the token would
>> always skip that element. If we want to enforce that level of atomicity, we
>> probably want to introduce another time travel query parameter (e.g.
>> asOf=1703003028000) to ensure that we are listing results at a specific
>> point of time of the warehouse, so the complete result list is fixed. (This
>> is also the missing piece I forgot to mention in the start index approach
>> to ensure it works in distributed settings)
>>
>> -Jack
>>
>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> I tried to cover these in more details at:
>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>
>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>> wrote:
>>>
>>>> +1 for this approach. I agree that the streaming approach requires that
>>>> http client and servers have http 2 streaming support, which is not
>>>> compatible with old clients.
>>>>
>>>> I share the same concern with Micah that only start/limit may not be
>>>> enough in a distributed environment where modification happens during
>>>> iterations. For compatibility, we need to consider several cases:
>>>>
>>>> 1. Old client <-> New Server
>>>> 2. New client <-> Old server
>>>>
>>>>
>>>>
>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>
>>>>> I agree that we want to include this feature and I raised similar
>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>
>>>>> For backward compatibility, just adding a start and limit implies a
>>>>> deterministic order, which is not a current requirement of the REST spec.
>>>>>
>>>>> Also, we need to consider whether the start/limit would need to be
>>>>> respected by the server.  If existing implementations simply return all 
>>>>> the
>>>>> results, will that be sufficient?  There are a few edge cases that need to
>>>>> be considered here.
>>>>>
>>>>> For the opaque key approach, I think adding a query param to
>>>>> trigger/continue and introducing a continuation token in
>>>>> the ListNamespacesResponse might allow for more backward compatibility.  
>>>>> In
>>>>> that scenario, pagination would only take place for clients who know how 
>>>>> to
>>>>> paginate and the ordering would not need to be deterministic.
>>>>>
>>>>> -Dan
>>>>>
>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Just to clarify and add a small suggestion:
>>>>>>
>>>>>> The behavior with no additional parameters requires the operations to
>>>>>> happen as they do today for backwards compatibility (i.e either all
>>>>>> responses are returned or a failure occurs).
>>>>>>
>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>> specific numeric offset) that can be returned by the service and a limit
>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>> default limit can be chosen by the server.  Servers might return less 
>>>>>> than
>>>>>> limit (i.e. clients are required to check for a next token to determine 
>>>>>> if
>>>>>> iteration is complete).  This enables server side state if it is desired
>>>>>> but also makes deterministic listing much more feasible (deterministic
>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>> only a
>>>>>> start offset is provided).
>>>>>>
>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>> responses being returned with the last part either containing a token if
>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>>>> service.
>>>>>>
>>>>>> Therefore it seems like using pagination with token and offset would
>>>>>> be preferred.  If skipping someplace in the middle of the namespaces is
>>>>>> required then I would suggest modelling those as first class query
>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> +1 for this approach
>>>>>>>
>>>>>>> I think it's good to use query params because it can be
>>>>>>> backward-compatible with the current behavior. If you get more than the
>>>>>>> limit back, then the service probably doesn't support pagination. And 
>>>>>>> if a
>>>>>>> client doesn't support pagination they get the same results that they 
>>>>>>> would
>>>>>>> today. A streaming approach with a continuation link like in the scan 
>>>>>>> API
>>>>>>> discussion wouldn't work because old clients don't know to make a second
>>>>>>> request.
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> During the conversation of the Scan API for REST spec, we touched
>>>>>>>> on the topic of pagination when REST response is large or takes time 
>>>>>>>> to be
>>>>>>>> produced.
>>>>>>>>
>>>>>>>> I just want to discuss this separately, since we also see the issue
>>>>>>>> for ListNamespaces and ListTables/Views, when integrating with a large
>>>>>>>> organization that has over 100k namespaces, and also a lot of tables in
>>>>>>>> some namespaces.
>>>>>>>>
>>>>>>>> Pagination requires either keeping state, or the response to be
>>>>>>>> deterministic such that the client can request a range of the full
>>>>>>>> response. If we want to avoid keeping state, I think we need to allow 
>>>>>>>> some
>>>>>>>> query parameters like:
>>>>>>>> - *start*: the start index of the item in the response
>>>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>>>
>>>>>>>> So we can send a request like:
>>>>>>>>
>>>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>>>
>>>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>>>
>>>>>>>> And the REST spec should enforce that the response returned for the
>>>>>>>> paginated GET should be deterministic.
>>>>>>>>
>>>>>>>> Any thoughts on this?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>

-- 
Ryan Blue
Tabular

Reply via email to