Re: Pagination for List APIs in the REST spec

Micah Kornfield Wed, 20 Dec 2023 16:43:24 -0800

>
> Overall, I don't think it's a good idea to add parallel listing for things
> like tables and namespaces as it just adds complexity for an incredibly
> narrow (and possibly poorly designed) use case.



+1 I think that there are likely a few ways parallelization of table and
namespace listing can be incorporated in the future into the API if
necessary.

I think the one place where parallelization is important immediately is for
Planning,  but that is already a separate thread.  Apologies if I forked
the conversation too far from that.

On Wed, Dec 20, 2023 at 4:06 PM Daniel Weeks <dwe...@apache.org> wrote:

> Overall, I don't think it's a good idea to add parallel listing for things
> like tables and namespaces as it just adds complexity for an incredibly
> narrow (and possibly poorly designed) use case.
>
> I feel we should leave it up to the server to define whether it will
> provide consistency across paginated listing and avoid
> bleeding time-travel like concepts (like 'asOf') into the API.  I really
> just don't see what practical value it provides as there are no explicit or
> consistently held guarantees around these operations.
>
> I'd agree with Micah's argument that if the server does provide stronger
> guarantees, it should manage those via the opaque token and respond with
> meaningful errors if it cannot satisfy the internal constraints it imposes
> (like timeouts).
>
> It would help to have articulable use cases to really invest in more
> complexity in this area and I feel like we're drifting a little into the
> speculative at this point.
>
> -Dan
>
>
>
> On Wed, Dec 20, 2023 at 3:27 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> I agree that this is not quite useful for clients at this moment. But I'm
>>> thinking that maybe exposing this will help debugging or diagnosing, user
>>> just need to be aware of this potential expiration.
>>
>>
>> I think if servers provide a meaningful error message on expiration
>> hopefully, this would be a good first step in debugging.  I think saying
>> tokens should generally support O(Minutes) at least should cover most
>> use-cases?
>>
>> On Tue, Dec 19, 2023 at 9:18 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> If we choose to manage state on the server side, I recommend not
>>>> revealing the expiration time to the client, at least not for now. We can
>>>> introduce it when there's a practical need. It wouldn't constitute a
>>>> breaking change, would it?
>>>
>>>
>>> I agree that this is not quite useful for clients at this moment. But
>>> I'm thinking that maybe exposing this will help debugging or diagnosing,
>>> user just need to be aware of this potential expiration.
>>>
>>> On Wed, Dec 20, 2023 at 11:09 AM Xuanwo <xua...@apache.org> wrote:
>>>
>>>> > For the continuation token, I think one missing part is about the
>>>> expiration time of this token, since this may affect the state cleaning
>>>> process of the server.
>>>>
>>>> Some storage services use a continuation token as a binary
>>>> representation of internal states. For example, they serialize a structure
>>>> into binary and then perform base64 encoding. Services don't need to
>>>> maintain state, eliminating the need for state cleaning.
>>>>
>>>> > Do servers need to expose the expiration time to clients?
>>>>
>>>> If we choose to manage state on the server side, I recommend not
>>>> revealing the expiration time to the client, at least not for now. We can
>>>> introduce it when there's a practical need. It wouldn't constitute a
>>>> breaking change, would it?
>>>>
>>>> On Wed, Dec 20, 2023, at 10:57, Renjie Liu wrote:
>>>>
>>>> For the continuation token, I think one missing part is about the
>>>> expiration time of this token, since this may affect the state
>>>> cleaning process of the server. There are several things to discuss:
>>>>
>>>> 1. Should we leave it to the server to decide it or allow the client to
>>>> config in api?
>>>>
>>>> Personally I think it would be enough for the server to determine it
>>>> for now, since I don't see any usage to allow clients to set the expiration
>>>> time in api.
>>>>
>>>> 2. Do servers need to expose the expiration time to clients?
>>>>
>>>> Personally I think it would be enough to expose this through the
>>>> getConfig api to let users know this. For now there is no requirement for
>>>> per request expiration time.
>>>>
>>>> On Wed, Dec 20, 2023 at 2:49 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> IMO, parallelization needs to be a first class entity in the end
>>>> point/service design to allow for flexibility (I scanned through the
>>>> original proposal for the scan planning and it looked like it was on the
>>>> right track).  Using offsets for parallelization is problematic from both a
>>>> consistency and scalability perspective if you want to allow for
>>>> flexibility in implementation.
>>>>
>>>> In particular, I think the server needs an APIs like:
>>>>
>>>> DoScan - returns a list of partitions (represented by an opaque
>>>> entity).  The list of partitions should support pagination (in an ideal
>>>> world, it would be streaming).
>>>> GetTasksForPartition - Returns scan tasks for a partition (should also
>>>> be paginated/streaming, but this is up for debate).  I think it is an
>>>> important consideration to allow for empty partitions.
>>>>
>>>> With this implementation you don't necessarily require separate server
>>>> side state (objects in GCS should be sufficient), I think as Ryan
>>>> suggested, one implementation could be to have each partition correspond to
>>>> a byte-range in a manifest file for returning the tasks.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>>
>>>> On Tue, Dec 19, 2023 at 9:55 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>> Not necessarily. That is more of a general statement. The pagination
>>>> discussion forked from server side scan planning.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> > With start/limit each client can query for own's chunk without
>>>> coordination.
>>>>
>>>> Okay, I understand now. Would you need to parallelize the client for
>>>> listing namespaces or tables? That seems odd to me.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>> > You can parallelize with opaque tokens by sending a starting point
>>>> for the next request.
>>>>
>>>> I meant we would have to wait for the server to return this starting
>>>> point from the past request? With start/limit each client can query for
>>>> own's chunk without coordination.
>>>>
>>>>
>>>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> > I think start and offset has the advantage of being parallelizable
>>>> (as compared to continuation tokens).
>>>>
>>>> You can parallelize with opaque tokens by sending a starting point for
>>>> the next request.
>>>>
>>>> > On the other hand, using "asOf" can be complex to  implement and may
>>>> be too powerful for the pagination use case
>>>>
>>>> I don't think that we want to add `asOf`. If the service chooses to do
>>>> this, it would send a continuation token that has the information embedded.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>> Can we assume it is the responsibility of the server to ensure
>>>> determinism (e.g., by caching the results along with query ID)? I think
>>>> start and offset has the advantage of being parallelizable (as compared to
>>>> continuation tokens). On the other hand, using "asOf" can be complex to
>>>>  implement and may be too powerful for the pagination use case (because it
>>>> allows to query the warehouse as of any point of time, not just now).
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> I think you can solve the atomicity problem with a continuation token
>>>> and server-side state. In general, I don't think this is a problem we
>>>> should worry about a lot since pagination commonly has this problem. But
>>>> since we can build a system that allows you to solve it if you choose to,
>>>> we should go with that design.
>>>>
>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Jack,
>>>> Some answers inline.
>>>>
>>>>
>>>> In addition to the start index approach, another potential simple way
>>>> to implement the continuation token is to use the last item name, when the
>>>> listing is guaranteed to be in lexicographic order.
>>>>
>>>>
>>>> I think this is one viable implementation, but the reason that the
>>>> token should be opaque is that it allows several different implementations
>>>> without client side changes.
>>>>
>>>> For example, if an element is added before the continuation token, then
>>>> all future listing calls with the token would always skip that element.
>>>>
>>>>
>>>> IMO, I think this is fine, for some of the REST APIs it is likely
>>>> important to put constraints on atomicity requirements, for others (e.g.
>>>> list namespaces) I think it is OK to have looser requirements.
>>>>
>>>> If we want to enforce that level of atomicity, we probably want to
>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) to
>>>> ensure that we are listing results at a specific point of time of the
>>>> warehouse, so the complete result list is fixed.
>>>>
>>>>
>>>> Time travel might be useful in some cases but I think it is orthogonal
>>>> to services wishing to have guarantees around  atomicity/consistency of
>>>> results.  If a server wants to ensure that results are atomic/consistent as
>>>> of the start of the listing, it can embed the necessary timestamp in the
>>>> token it returns and parse it out when fetching the next result.
>>>>
>>>> I think this does raise a more general point around service definition
>>>> evolution in general.  I think there likely need to be metadata endpoints
>>>> that expose either:
>>>> 1.  A version of the REST API supported.
>>>> 2.  Features the API supports (e.g. which query parameters are honored
>>>> for a specific endpoint).
>>>>
>>>> There are pros and cons to both approaches (apologies if I missed this
>>>> in the spec or if it has already been discussed).
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>> Yes I agree that it is better to not enforce the implementation to
>>>> favor any direction, and continuation token is probably better than
>>>> enforcing a numeric start index.
>>>>
>>>> In addition to the start index approach, another potential simple way
>>>> to implement the continuation token is to use the last item name, when the
>>>> listing is guaranteed to be in lexicographic order. Compared to the start
>>>> index approach, it does not need to worry about the change of start index
>>>> when something in the list is added or removed.
>>>>
>>>> However, the issue of concurrent modification could still exist even
>>>> with a continuation token. For example, if an element is added before the
>>>> continuation token, then all future listing calls with the token would
>>>> always skip that element. If we want to enforce that level of atomicity, we
>>>> probably want to introduce another time travel query parameter (e.g.
>>>> asOf=1703003028000) to ensure that we are listing results at a specific
>>>> point of time of the warehouse, so the complete result list is fixed. (This
>>>> is also the missing piece I forgot to mention in the start index approach
>>>> to ensure it works in distributed settings)
>>>>
>>>> -Jack
>>>>
>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> I tried to cover these in more details at:
>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit
>>>>
>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com>
>>>> wrote:
>>>>
>>>> +1 for this approach. I agree that the streaming approach requires that
>>>> http client and servers have http 2 streaming support, which is not
>>>> compatible with old clients.
>>>>
>>>> I share the same concern with Micah that only start/limit may not be
>>>> enough in a distributed environment where modification happens during
>>>> iterations. For compatibility, we need to consider several cases:
>>>>
>>>> 1. Old client <-> New Server
>>>> 2. New client <-> Old server
>>>>
>>>>
>>>>
>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>
>>>> I agree that we want to include this feature and I raised similar
>>>> concerns to what Micah already presented in talking with Ryan.
>>>>
>>>> For backward compatibility, just adding a start and limit implies a
>>>> deterministic order, which is not a current requirement of the REST spec.
>>>>
>>>> Also, we need to consider whether the start/limit would need to be
>>>> respected by the server.  If existing implementations simply return all the
>>>> results, will that be sufficient?  There are a few edge cases that need to
>>>> be considered here.
>>>>
>>>> For the opaque key approach, I think adding a query param to
>>>> trigger/continue and introducing a continuation token in
>>>> the ListNamespacesResponse might allow for more backward compatibility.  In
>>>> that scenario, pagination would only take place for clients who know how to
>>>> paginate and the ordering would not need to be deterministic.
>>>>
>>>> -Dan
>>>>
>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> Just to clarify and add a small suggestion:
>>>>
>>>> The behavior with no additional parameters requires the operations to
>>>> happen as they do today for backwards compatibility (i.e either all
>>>> responses are returned or a failure occurs).
>>>>
>>>> For new parameters, I'd suggest an opaque start token (instead of
>>>> specific numeric offset) that can be returned by the service and a limit
>>>> (as proposed above). If a start token is provided without a limit a
>>>> default limit can be chosen by the server.  Servers might return less than
>>>> limit (i.e. clients are required to check for a next token to determine if
>>>> iteration is complete).  This enables server side state if it is desired
>>>> but also makes deterministic listing much more feasible (deterministic
>>>> responses are essentially impossible in the face of changing data if only a
>>>> start offset is provided).
>>>>
>>>> In an ideal world, specifying a limit would result in streaming
>>>> responses being returned with the last part either containing a token if
>>>> continuation is necessary.  Given conversation on the other thread of
>>>> streaming, I'd imagine this is quite hard to model in an Open API REST
>>>> service.
>>>>
>>>> Therefore it seems like using pagination with token and offset would be
>>>> preferred.  If skipping someplace in the middle of the namespaces is
>>>> required then I would suggest modelling those as first class query
>>>> parameters (e.g. "startAfterNamespace")
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>> +1 for this approach
>>>>
>>>> I think it's good to use query params because it can be
>>>> backward-compatible with the current behavior. If you get more than the
>>>> limit back, then the service probably doesn't support pagination. And if a
>>>> client doesn't support pagination they get the same results that they would
>>>> today. A streaming approach with a continuation link like in the scan API
>>>> discussion wouldn't work because old clients don't know to make a second
>>>> request.
>>>>
>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> During the conversation of the Scan API for REST spec, we touched on
>>>> the topic of pagination when REST response is large or takes time to be
>>>> produced.
>>>>
>>>> I just want to discuss this separately, since we also see the issue for
>>>> ListNamespaces and ListTables/Views, when integrating with a large
>>>> organization that has over 100k namespaces, and also a lot of tables in
>>>> some namespaces.
>>>>
>>>> Pagination requires either keeping state, or the response to be
>>>> deterministic such that the client can request a range of the full
>>>> response. If we want to avoid keeping state, I think we need to allow some
>>>> query parameters like:
>>>> - *start*: the start index of the item in the response
>>>> - *limit*: the number of items to be returned in the response
>>>>
>>>> So we can send a request like:
>>>>
>>>> *GET /namespaces?start=300&limit=100*
>>>>
>>>> *GET /namespaces/ns/tables?start=300&limit=100*
>>>>
>>>> And the REST spec should enforce that the response returned for the
>>>> paginated GET should be deterministic.
>>>>
>>>> Any thoughts on this?
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>>> Xuanwo
>>>>
>>>>

Re: Pagination for List APIs in the REST spec

Reply via email to