Re: Pagination for List APIs in the REST spec

Pucheng Yang Sat, 18 May 2024 16:06:04 -0700

Hi all, is there an ETA for this? thanks

On Wed, Dec 20, 2023 at 6:03 PM Renjie Liu <liurenjie2...@gmail.com> wrote:


> I think if servers provide a meaningful error message on expiration
>> hopefully, this would be a good first step in debugging.  I think saying
>> tokens should generally support O(Minutes) at least should cover most
>> use-cases?
>>
>
> Sounds reasonable to me. Clients just need to be aware that the token is
> for transient usage and should not store it for too long.
>
> On Thu, Dec 21, 2023 at 8:43 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Overall, I don't think it's a good idea to add parallel listing for
>>> things like tables and namespaces as it just adds complexity for an
>>> incredibly narrow (and possibly poorly designed) use case.
>>
>>
>> +1 I think that there are likely a few ways parallelization of table and
>> namespace listing can be incorporated in the future into the API if
>> necessary.
>>
>> I think the one place where parallelization is important immediately is
>> for Planning,  but that is already a separate thread.  Apologies if I
>> forked the conversation too far from that.
>>
>> On Wed, Dec 20, 2023 at 4:06 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> Overall, I don't think it's a good idea to add parallel listing for
>>> things like tables and namespaces as it just adds complexity for an
>>> incredibly narrow (and possibly poorly designed) use case.
>>>
>>> I feel we should leave it up to the server to define whether it will
>>> provide consistency across paginated listing and avoid
>>> bleeding time-travel like concepts (like 'asOf') into the API.  I really
>>> just don't see what practical value it provides as there are no explicit or
>>> consistently held guarantees around these operations.
>>>
>>> I'd agree with Micah's argument that if the server does provide stronger
>>> guarantees, it should manage those via the opaque token and respond with
>>> meaningful errors if it cannot satisfy the internal constraints it imposes
>>> (like timeouts).
>>>
>>> It would help to have articulable use cases to really invest in more
>>> complexity in this area and I feel like we're drifting a little into the
>>> speculative at this point.
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Wed, Dec 20, 2023 at 3:27 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> I agree that this is not quite useful for clients at this moment. But
>>>>> I'm thinking that maybe exposing this will help debugging or diagnosing,
>>>>> user just need to be aware of this potential expiration.
>>>>
>>>>
>>>> I think if servers provide a meaningful error message on expiration
>>>> hopefully, this would be a good first step in debugging.  I think saying
>>>> tokens should generally support O(Minutes) at least should cover most
>>>> use-cases?
>>>>
>>>> On Tue, Dec 19, 2023 at 9:18 PM Renjie Liu <liurenjie2...@gmail.com>
>>>> wrote:
>>>>
>>>>> If we choose to manage state on the server side, I recommend not
>>>>>> revealing the expiration time to the client, at least not for now. We can
>>>>>> introduce it when there's a practical need. It wouldn't constitute a
>>>>>> breaking change, would it?
>>>>>
>>>>>
>>>>> I agree that this is not quite useful for clients at this moment. But
>>>>> I'm thinking that maybe exposing this will help debugging or diagnosing,
>>>>> user just need to be aware of this potential expiration.
>>>>>
>>>>> On Wed, Dec 20, 2023 at 11:09 AM Xuanwo <xua...@apache.org> wrote:
>>>>>
>>>>>> > For the continuation token, I think one missing part is about the
>>>>>> expiration time of this token, since this may affect the state cleaning
>>>>>> process of the server.
>>>>>>
>>>>>> Some storage services use a continuation token as a binary
>>>>>> representation of internal states. For example, they serialize a 
>>>>>> structure
>>>>>> into binary and then perform base64 encoding. Services don't need to
>>>>>> maintain state, eliminating the need for state cleaning.
>>>>>>
>>>>>> > Do servers need to expose the expiration time to clients?
>>>>>>
>>>>>> If we choose to manage state on the server side, I recommend not
>>>>>> revealing the expiration time to the client, at least not for now. We can
>>>>>> introduce it when there's a practical need. It wouldn't constitute a
>>>>>> breaking change, would it?
>>>>>>
>>>>>> On Wed, Dec 20, 2023, at 10:57, Renjie Liu wrote:
>>>>>>
>>>>>> For the continuation token, I think one missing part is about the
>>>>>> expiration time of this token, since this may affect the state
>>>>>> cleaning process of the server. There are several things to discuss:
>>>>>>
>>>>>> 1. Should we leave it to the server to decide it or allow the client
>>>>>> to config in api?
>>>>>>
>>>>>> Personally I think it would be enough for the server to determine it
>>>>>> for now, since I don't see any usage to allow clients to set the 
>>>>>> expiration
>>>>>> time in api.
>>>>>>
>>>>>> 2. Do servers need to expose the expiration time to clients?
>>>>>>
>>>>>> Personally I think it would be enough to expose this through the
>>>>>> getConfig api to let users know this. For now there is no requirement for
>>>>>> per request expiration time.
>>>>>>
>>>>>> On Wed, Dec 20, 2023 at 2:49 AM Micah Kornfield <
>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>
>>>>>> IMO, parallelization needs to be a first class entity in the end
>>>>>> point/service design to allow for flexibility (I scanned through the
>>>>>> original proposal for the scan planning and it looked like it was on the
>>>>>> right track).  Using offsets for parallelization is problematic from 
>>>>>> both a
>>>>>> consistency and scalability perspective if you want to allow for
>>>>>> flexibility in implementation.
>>>>>>
>>>>>> In particular, I think the server needs an APIs like:
>>>>>>
>>>>>> DoScan - returns a list of partitions (represented by an opaque
>>>>>> entity).  The list of partitions should support pagination (in an ideal
>>>>>> world, it would be streaming).
>>>>>> GetTasksForPartition - Returns scan tasks for a partition (should
>>>>>> also be paginated/streaming, but this is up for debate).  I think it is 
>>>>>> an
>>>>>> important consideration to allow for empty partitions.
>>>>>>
>>>>>> With this implementation you don't necessarily require separate
>>>>>> server side state (objects in GCS should be sufficient), I think as Ryan
>>>>>> suggested, one implementation could be to have each partition correspond 
>>>>>> to
>>>>>> a byte-range in a manifest file for returning the tasks.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:55 AM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>> Not necessarily. That is more of a general statement. The pagination
>>>>>> discussion forked from server side scan planning.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>> > With start/limit each client can query for own's chunk without
>>>>>> coordination.
>>>>>>
>>>>>> Okay, I understand now. Would you need to parallelize the client for
>>>>>> listing namespaces or tables? That seems odd to me.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>> > You can parallelize with opaque tokens by sending a starting point
>>>>>> for the next request.
>>>>>>
>>>>>> I meant we would have to wait for the server to return this starting
>>>>>> point from the past request? With start/limit each client can query for
>>>>>> own's chunk without coordination.
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>> > I think start and offset has the advantage of being parallelizable
>>>>>> (as compared to continuation tokens).
>>>>>>
>>>>>> You can parallelize with opaque tokens by sending a starting point
>>>>>> for the next request.
>>>>>>
>>>>>> > On the other hand, using "asOf" can be complex to  implement and
>>>>>> may be too powerful for the pagination use case
>>>>>>
>>>>>> I don't think that we want to add `asOf`. If the service chooses to
>>>>>> do this, it would send a continuation token that has the
>>>>>> information embedded.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>> Can we assume it is the responsibility of the server to ensure
>>>>>> determinism (e.g., by caching the results along with query ID)? I think
>>>>>> start and offset has the advantage of being parallelizable (as compared 
>>>>>> to
>>>>>> continuation tokens). On the other hand, using "asOf" can be complex to
>>>>>>  implement and may be too powerful for the pagination use case (because 
>>>>>> it
>>>>>> allows to query the warehouse as of any point of time, not just now).
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>> I think you can solve the atomicity problem with a continuation token
>>>>>> and server-side state. In general, I don't think this is a problem we
>>>>>> should worry about a lot since pagination commonly has this problem. But
>>>>>> since we can build a system that allows you to solve it if you choose to,
>>>>>> we should go with that design.
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <
>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Jack,
>>>>>> Some answers inline.
>>>>>>
>>>>>>
>>>>>> In addition to the start index approach, another potential simple way
>>>>>> to implement the continuation token is to use the last item name, when 
>>>>>> the
>>>>>> listing is guaranteed to be in lexicographic order.
>>>>>>
>>>>>>
>>>>>> I think this is one viable implementation, but the reason that the
>>>>>> token should be opaque is that it allows several different 
>>>>>> implementations
>>>>>> without client side changes.
>>>>>>
>>>>>> For example, if an element is added before the continuation token,
>>>>>> then all future listing calls with the token would always skip that
>>>>>> element.
>>>>>>
>>>>>>
>>>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>>>> list namespaces) I think it is OK to have looser requirements.
>>>>>>
>>>>>> If we want to enforce that level of atomicity, we probably want to
>>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) 
>>>>>> to
>>>>>> ensure that we are listing results at a specific point of time of the
>>>>>> warehouse, so the complete result list is fixed.
>>>>>>
>>>>>>
>>>>>> Time travel might be useful in some cases but I think it is
>>>>>> orthogonal to services wishing to have guarantees around
>>>>>> atomicity/consistency of results.  If a server wants to ensure that 
>>>>>> results
>>>>>> are atomic/consistent as of the start of the listing, it can embed the
>>>>>> necessary timestamp in the token it returns and parse it out when 
>>>>>> fetching
>>>>>> the next result.
>>>>>>
>>>>>> I think this does raise a more general point around service
>>>>>> definition evolution in general.  I think there likely need to be 
>>>>>> metadata
>>>>>> endpoints that expose either:
>>>>>> 1.  A version of the REST API supported.
>>>>>> 2.  Features the API supports (e.g. which query parameters are
>>>>>> honored for a specific endpoint).
>>>>>>
>>>>>> There are pros and cons to both approaches (apologies if I missed
>>>>>> this in the spec or if it has already been discussed).
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>> Yes I agree that it is better to not enforce the implementation to
>>>>>> favor any direction, and continuation token is probably better than
>>>>>> enforcing a numeric start index.
>>>>>>
>>>>>> In addition to the start index approach, another potential simple way
>>>>>> to implement the continuation token is to use the last item name, when 
>>>>>> the
>>>>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>>>>> index approach, it does not need to worry about the change of start index
>>>>>> when something in the list is added or removed.
>>>>>>
>>>>>> However, the issue of concurrent modification could still exist even
>>>>>> with a continuation token. For example, if an element is added before the
>>>>>> continuation token, then all future listing calls with the token would
>>>>>> always skip that element. If we want to enforce that level of atomicity, 
>>>>>> we
>>>>>> probably want to introduce another time travel query parameter (e.g.
>>>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>>>> point of time of the warehouse, so the complete result list is fixed. 
>>>>>> (This
>>>>>> is also the missing piece I forgot to mention in the start index approach
>>>>>> to ensure it works in distributed settings)
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I tried to cover these in more details at:
>>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>>>
>>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> +1 for this approach. I agree that the streaming approach requires
>>>>>> that http client and servers have http 2 streaming support, which is not
>>>>>> compatible with old clients.
>>>>>>
>>>>>> I share the same concern with Micah that only start/limit may not be
>>>>>> enough in a distributed environment where modification happens during
>>>>>> iterations. For compatibility, we need to consider several cases:
>>>>>>
>>>>>> 1. Old client <-> New Server
>>>>>> 2. New client <-> Old server
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> I agree that we want to include this feature and I raised similar
>>>>>> concerns to what Micah already presented in talking with Ryan.
>>>>>>
>>>>>> For backward compatibility, just adding a start and limit implies a
>>>>>> deterministic order, which is not a current requirement of the REST spec.
>>>>>>
>>>>>> Also, we need to consider whether the start/limit would need to be
>>>>>> respected by the server.  If existing implementations simply return all 
>>>>>> the
>>>>>> results, will that be sufficient?  There are a few edge cases that need 
>>>>>> to
>>>>>> be considered here.
>>>>>>
>>>>>> For the opaque key approach, I think adding a query param to
>>>>>> trigger/continue and introducing a continuation token in
>>>>>> the ListNamespacesResponse might allow for more backward compatibility.  
>>>>>> In
>>>>>> that scenario, pagination would only take place for clients who know how 
>>>>>> to
>>>>>> paginate and the ordering would not need to be deterministic.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just to clarify and add a small suggestion:
>>>>>>
>>>>>> The behavior with no additional parameters requires the operations to
>>>>>> happen as they do today for backwards compatibility (i.e either all
>>>>>> responses are returned or a failure occurs).
>>>>>>
>>>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>>>> specific numeric offset) that can be returned by the service and a limit
>>>>>> (as proposed above). If a start token is provided without a limit a
>>>>>> default limit can be chosen by the server.  Servers might return less 
>>>>>> than
>>>>>> limit (i.e. clients are required to check for a next token to determine 
>>>>>> if
>>>>>> iteration is complete).  This enables server side state if it is desired
>>>>>> but also makes deterministic listing much more feasible (deterministic
>>>>>> responses are essentially impossible in the face of changing data if 
>>>>>> only a
>>>>>> start offset is provided).
>>>>>>
>>>>>> In an ideal world, specifying a limit would result in streaming
>>>>>> responses being returned with the last part either containing a token if
>>>>>> continuation is necessary.  Given conversation on the other thread of
>>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>>>> service.
>>>>>>
>>>>>> Therefore it seems like using pagination with token and offset would
>>>>>> be preferred.  If skipping someplace in the middle of the namespaces is
>>>>>> required then I would suggest modelling those as first class query
>>>>>> parameters (e.g. "startAfterNamespace")
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>> +1 for this approach
>>>>>>
>>>>>> I think it's good to use query params because it can be
>>>>>> backward-compatible with the current behavior. If you get more than the
>>>>>> limit back, then the service probably doesn't support pagination. And if 
>>>>>> a
>>>>>> client doesn't support pagination they get the same results that they 
>>>>>> would
>>>>>> today. A streaming approach with a continuation link like in the scan API
>>>>>> discussion wouldn't work because old clients don't know to make a second
>>>>>> request.
>>>>>>
>>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> During the conversation of the Scan API for REST spec, we touched on
>>>>>> the topic of pagination when REST response is large or takes time to be
>>>>>> produced.
>>>>>>
>>>>>> I just want to discuss this separately, since we also see the issue
>>>>>> for ListNamespaces and ListTables/Views, when integrating with a large
>>>>>> organization that has over 100k namespaces, and also a lot of tables in
>>>>>> some namespaces.
>>>>>>
>>>>>> Pagination requires either keeping state, or the response to be
>>>>>> deterministic such that the client can request a range of the full
>>>>>> response. If we want to avoid keeping state, I think we need to allow 
>>>>>> some
>>>>>> query parameters like:
>>>>>> - *start*: the start index of the item in the response
>>>>>> - *limit*: the number of items to be returned in the response
>>>>>>
>>>>>> So we can send a request like:
>>>>>>
>>>>>> *GET /namespaces?start=300&limit=100*
>>>>>>
>>>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>>>
>>>>>> And the REST spec should enforce that the response returned for the
>>>>>> paginated GET should be deterministic.
>>>>>>
>>>>>> Any thoughts on this?
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>> Xuanwo
>>>>>>
>>>>>>

Re: Pagination for List APIs in the REST spec

Reply via email to