> > I agree that this is not quite useful for clients at this moment. But I'm > thinking that maybe exposing this will help debugging or diagnosing, user > just need to be aware of this potential expiration.
I think if servers provide a meaningful error message on expiration hopefully, this would be a good first step in debugging. I think saying tokens should generally support O(Minutes) at least should cover most use-cases? On Tue, Dec 19, 2023 at 9:18 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > If we choose to manage state on the server side, I recommend not revealing >> the expiration time to the client, at least not for now. We can introduce >> it when there's a practical need. It wouldn't constitute a breaking change, >> would it? > > > I agree that this is not quite useful for clients at this moment. But I'm > thinking that maybe exposing this will help debugging or diagnosing, user > just need to be aware of this potential expiration. > > On Wed, Dec 20, 2023 at 11:09 AM Xuanwo <xua...@apache.org> wrote: > >> > For the continuation token, I think one missing part is about the >> expiration time of this token, since this may affect the state cleaning >> process of the server. >> >> Some storage services use a continuation token as a binary representation >> of internal states. For example, they serialize a structure into binary and >> then perform base64 encoding. Services don't need to maintain state, >> eliminating the need for state cleaning. >> >> > Do servers need to expose the expiration time to clients? >> >> If we choose to manage state on the server side, I recommend not >> revealing the expiration time to the client, at least not for now. We can >> introduce it when there's a practical need. It wouldn't constitute a >> breaking change, would it? >> >> On Wed, Dec 20, 2023, at 10:57, Renjie Liu wrote: >> >> For the continuation token, I think one missing part is about the >> expiration time of this token, since this may affect the state >> cleaning process of the server. There are several things to discuss: >> >> 1. Should we leave it to the server to decide it or allow the client to >> config in api? >> >> Personally I think it would be enough for the server to determine it for >> now, since I don't see any usage to allow clients to set the expiration >> time in api. >> >> 2. Do servers need to expose the expiration time to clients? >> >> Personally I think it would be enough to expose this through the >> getConfig api to let users know this. For now there is no requirement for >> per request expiration time. >> >> On Wed, Dec 20, 2023 at 2:49 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> IMO, parallelization needs to be a first class entity in the end >> point/service design to allow for flexibility (I scanned through the >> original proposal for the scan planning and it looked like it was on the >> right track). Using offsets for parallelization is problematic from both a >> consistency and scalability perspective if you want to allow for >> flexibility in implementation. >> >> In particular, I think the server needs an APIs like: >> >> DoScan - returns a list of partitions (represented by an opaque entity). >> The list of partitions should support pagination (in an ideal world, it >> would be streaming). >> GetTasksForPartition - Returns scan tasks for a partition (should also be >> paginated/streaming, but this is up for debate). I think it is an >> important consideration to allow for empty partitions. >> >> With this implementation you don't necessarily require separate server >> side state (objects in GCS should be sufficient), I think as Ryan >> suggested, one implementation could be to have each partition correspond to >> a byte-range in a manifest file for returning the tasks. >> >> Thanks, >> Micah >> >> >> >> On Tue, Dec 19, 2023 at 9:55 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >> Not necessarily. That is more of a general statement. The pagination >> discussion forked from server side scan planning. >> >> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote: >> >> > With start/limit each client can query for own's chunk without >> coordination. >> >> Okay, I understand now. Would you need to parallelize the client for >> listing namespaces or tables? That seems odd to me. >> >> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >> > You can parallelize with opaque tokens by sending a starting point for >> the next request. >> >> I meant we would have to wait for the server to return this starting >> point from the past request? With start/limit each client can query for >> own's chunk without coordination. >> >> >> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote: >> >> > I think start and offset has the advantage of being parallelizable (as >> compared to continuation tokens). >> >> You can parallelize with opaque tokens by sending a starting point for >> the next request. >> >> > On the other hand, using "asOf" can be complex to implement and may be >> too powerful for the pagination use case >> >> I don't think that we want to add `asOf`. If the service chooses to do >> this, it would send a continuation token that has the information embedded. >> >> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >> Can we assume it is the responsibility of the server to ensure >> determinism (e.g., by caching the results along with query ID)? I think >> start and offset has the advantage of being parallelizable (as compared to >> continuation tokens). On the other hand, using "asOf" can be complex to >> implement and may be too powerful for the pagination use case (because it >> allows to query the warehouse as of any point of time, not just now). >> >> Thanks, >> Walaa. >> >> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: >> >> I think you can solve the atomicity problem with a continuation token and >> server-side state. In general, I don't think this is a problem we should >> worry about a lot since pagination commonly has this problem. But since we >> can build a system that allows you to solve it if you choose to, we should >> go with that design. >> >> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> Hi Jack, >> Some answers inline. >> >> >> In addition to the start index approach, another potential simple way to >> implement the continuation token is to use the last item name, when the >> listing is guaranteed to be in lexicographic order. >> >> >> I think this is one viable implementation, but the reason that the token >> should be opaque is that it allows several different implementations >> without client side changes. >> >> For example, if an element is added before the continuation token, then >> all future listing calls with the token would always skip that element. >> >> >> IMO, I think this is fine, for some of the REST APIs it is likely >> important to put constraints on atomicity requirements, for others (e.g. >> list namespaces) I think it is OK to have looser requirements. >> >> If we want to enforce that level of atomicity, we probably want to >> introduce another time travel query parameter (e.g. asOf=1703003028000) to >> ensure that we are listing results at a specific point of time of the >> warehouse, so the complete result list is fixed. >> >> >> Time travel might be useful in some cases but I think it is orthogonal to >> services wishing to have guarantees around atomicity/consistency of >> results. If a server wants to ensure that results are atomic/consistent as >> of the start of the listing, it can embed the necessary timestamp in the >> token it returns and parse it out when fetching the next result. >> >> I think this does raise a more general point around service definition >> evolution in general. I think there likely need to be metadata endpoints >> that expose either: >> 1. A version of the REST API supported. >> 2. Features the API supports (e.g. which query parameters are honored >> for a specific endpoint). >> >> There are pros and cons to both approaches (apologies if I missed this in >> the spec or if it has already been discussed). >> >> Cheers, >> Micah >> >> >> >> >> >> >> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >> >> Yes I agree that it is better to not enforce the implementation to favor >> any direction, and continuation token is probably better than enforcing a >> numeric start index. >> >> In addition to the start index approach, another potential simple way to >> implement the continuation token is to use the last item name, when the >> listing is guaranteed to be in lexicographic order. Compared to the start >> index approach, it does not need to worry about the change of start index >> when something in the list is added or removed. >> >> However, the issue of concurrent modification could still exist even with >> a continuation token. For example, if an element is added before the >> continuation token, then all future listing calls with the token would >> always skip that element. If we want to enforce that level of atomicity, we >> probably want to introduce another time travel query parameter (e.g. >> asOf=1703003028000) to ensure that we are listing results at a specific >> point of time of the warehouse, so the complete result list is fixed. (This >> is also the missing piece I forgot to mention in the start index approach >> to ensure it works in distributed settings) >> >> -Jack >> >> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> I tried to cover these in more details at: >> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >> >> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >> +1 for this approach. I agree that the streaming approach requires that >> http client and servers have http 2 streaming support, which is not >> compatible with old clients. >> >> I share the same concern with Micah that only start/limit may not be >> enough in a distributed environment where modification happens during >> iterations. For compatibility, we need to consider several cases: >> >> 1. Old client <-> New Server >> 2. New client <-> Old server >> >> >> >> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> wrote: >> >> I agree that we want to include this feature and I raised similar >> concerns to what Micah already presented in talking with Ryan. >> >> For backward compatibility, just adding a start and limit implies a >> deterministic order, which is not a current requirement of the REST spec. >> >> Also, we need to consider whether the start/limit would need to be >> respected by the server. If existing implementations simply return all the >> results, will that be sufficient? There are a few edge cases that need to >> be considered here. >> >> For the opaque key approach, I think adding a query param to >> trigger/continue and introducing a continuation token in >> the ListNamespacesResponse might allow for more backward compatibility. In >> that scenario, pagination would only take place for clients who know how to >> paginate and the ordering would not need to be deterministic. >> >> -Dan >> >> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> Just to clarify and add a small suggestion: >> >> The behavior with no additional parameters requires the operations to >> happen as they do today for backwards compatibility (i.e either all >> responses are returned or a failure occurs). >> >> For new parameters, I'd suggest an opaque start token (instead of >> specific numeric offset) that can be returned by the service and a limit >> (as proposed above). If a start token is provided without a limit a >> default limit can be chosen by the server. Servers might return less than >> limit (i.e. clients are required to check for a next token to determine if >> iteration is complete). This enables server side state if it is desired >> but also makes deterministic listing much more feasible (deterministic >> responses are essentially impossible in the face of changing data if only a >> start offset is provided). >> >> In an ideal world, specifying a limit would result in streaming responses >> being returned with the last part either containing a token if continuation >> is necessary. Given conversation on the other thread of streaming, I'd >> imagine this is quite hard to model in an Open API REST service. >> >> Therefore it seems like using pagination with token and offset would be >> preferred. If skipping someplace in the middle of the namespaces is >> required then I would suggest modelling those as first class query >> parameters (e.g. "startAfterNamespace") >> >> Cheers, >> Micah >> >> >> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote: >> >> +1 for this approach >> >> I think it's good to use query params because it can be >> backward-compatible with the current behavior. If you get more than the >> limit back, then the service probably doesn't support pagination. And if a >> client doesn't support pagination they get the same results that they would >> today. A streaming approach with a continuation link like in the scan API >> discussion wouldn't work because old clients don't know to make a second >> request. >> >> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >> >> Hi everyone, >> >> During the conversation of the Scan API for REST spec, we touched on the >> topic of pagination when REST response is large or takes time to be >> produced. >> >> I just want to discuss this separately, since we also see the issue for >> ListNamespaces and ListTables/Views, when integrating with a large >> organization that has over 100k namespaces, and also a lot of tables in >> some namespaces. >> >> Pagination requires either keeping state, or the response to be >> deterministic such that the client can request a range of the full >> response. If we want to avoid keeping state, I think we need to allow some >> query parameters like: >> - *start*: the start index of the item in the response >> - *limit*: the number of items to be returned in the response >> >> So we can send a request like: >> >> *GET /namespaces?start=300&limit=100* >> >> *GET /namespaces/ns/tables?start=300&limit=100* >> >> And the REST spec should enforce that the response returned for the >> paginated GET should be deterministic. >> >> Any thoughts on this? >> >> Best, >> Jack Ye >> >> >> >> -- >> Ryan Blue >> Tabular >> >> >> >> -- >> Ryan Blue >> Tabular >> >> >> >> -- >> Ryan Blue >> Tabular >> >> >> >> -- >> Ryan Blue >> Tabular >> >> >> Xuanwo >> >>