Hi all, is there an ETA for this? thanks On Wed, Dec 20, 2023 at 6:03 PM Renjie Liu <liurenjie2...@gmail.com> wrote:
> I think if servers provide a meaningful error message on expiration >> hopefully, this would be a good first step in debugging. I think saying >> tokens should generally support O(Minutes) at least should cover most >> use-cases? >> > > Sounds reasonable to me. Clients just need to be aware that the token is > for transient usage and should not store it for too long. > > On Thu, Dec 21, 2023 at 8:43 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Overall, I don't think it's a good idea to add parallel listing for >>> things like tables and namespaces as it just adds complexity for an >>> incredibly narrow (and possibly poorly designed) use case. >> >> >> +1 I think that there are likely a few ways parallelization of table and >> namespace listing can be incorporated in the future into the API if >> necessary. >> >> I think the one place where parallelization is important immediately is >> for Planning, but that is already a separate thread. Apologies if I >> forked the conversation too far from that. >> >> On Wed, Dec 20, 2023 at 4:06 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Overall, I don't think it's a good idea to add parallel listing for >>> things like tables and namespaces as it just adds complexity for an >>> incredibly narrow (and possibly poorly designed) use case. >>> >>> I feel we should leave it up to the server to define whether it will >>> provide consistency across paginated listing and avoid >>> bleeding time-travel like concepts (like 'asOf') into the API. I really >>> just don't see what practical value it provides as there are no explicit or >>> consistently held guarantees around these operations. >>> >>> I'd agree with Micah's argument that if the server does provide stronger >>> guarantees, it should manage those via the opaque token and respond with >>> meaningful errors if it cannot satisfy the internal constraints it imposes >>> (like timeouts). >>> >>> It would help to have articulable use cases to really invest in more >>> complexity in this area and I feel like we're drifting a little into the >>> speculative at this point. >>> >>> -Dan >>> >>> >>> >>> On Wed, Dec 20, 2023 at 3:27 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> I agree that this is not quite useful for clients at this moment. But >>>>> I'm thinking that maybe exposing this will help debugging or diagnosing, >>>>> user just need to be aware of this potential expiration. >>>> >>>> >>>> I think if servers provide a meaningful error message on expiration >>>> hopefully, this would be a good first step in debugging. I think saying >>>> tokens should generally support O(Minutes) at least should cover most >>>> use-cases? >>>> >>>> On Tue, Dec 19, 2023 at 9:18 PM Renjie Liu <liurenjie2...@gmail.com> >>>> wrote: >>>> >>>>> If we choose to manage state on the server side, I recommend not >>>>>> revealing the expiration time to the client, at least not for now. We can >>>>>> introduce it when there's a practical need. It wouldn't constitute a >>>>>> breaking change, would it? >>>>> >>>>> >>>>> I agree that this is not quite useful for clients at this moment. But >>>>> I'm thinking that maybe exposing this will help debugging or diagnosing, >>>>> user just need to be aware of this potential expiration. >>>>> >>>>> On Wed, Dec 20, 2023 at 11:09 AM Xuanwo <xua...@apache.org> wrote: >>>>> >>>>>> > For the continuation token, I think one missing part is about the >>>>>> expiration time of this token, since this may affect the state cleaning >>>>>> process of the server. >>>>>> >>>>>> Some storage services use a continuation token as a binary >>>>>> representation of internal states. For example, they serialize a >>>>>> structure >>>>>> into binary and then perform base64 encoding. Services don't need to >>>>>> maintain state, eliminating the need for state cleaning. >>>>>> >>>>>> > Do servers need to expose the expiration time to clients? >>>>>> >>>>>> If we choose to manage state on the server side, I recommend not >>>>>> revealing the expiration time to the client, at least not for now. We can >>>>>> introduce it when there's a practical need. It wouldn't constitute a >>>>>> breaking change, would it? >>>>>> >>>>>> On Wed, Dec 20, 2023, at 10:57, Renjie Liu wrote: >>>>>> >>>>>> For the continuation token, I think one missing part is about the >>>>>> expiration time of this token, since this may affect the state >>>>>> cleaning process of the server. There are several things to discuss: >>>>>> >>>>>> 1. Should we leave it to the server to decide it or allow the client >>>>>> to config in api? >>>>>> >>>>>> Personally I think it would be enough for the server to determine it >>>>>> for now, since I don't see any usage to allow clients to set the >>>>>> expiration >>>>>> time in api. >>>>>> >>>>>> 2. Do servers need to expose the expiration time to clients? >>>>>> >>>>>> Personally I think it would be enough to expose this through the >>>>>> getConfig api to let users know this. For now there is no requirement for >>>>>> per request expiration time. >>>>>> >>>>>> On Wed, Dec 20, 2023 at 2:49 AM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>> IMO, parallelization needs to be a first class entity in the end >>>>>> point/service design to allow for flexibility (I scanned through the >>>>>> original proposal for the scan planning and it looked like it was on the >>>>>> right track). Using offsets for parallelization is problematic from >>>>>> both a >>>>>> consistency and scalability perspective if you want to allow for >>>>>> flexibility in implementation. >>>>>> >>>>>> In particular, I think the server needs an APIs like: >>>>>> >>>>>> DoScan - returns a list of partitions (represented by an opaque >>>>>> entity). The list of partitions should support pagination (in an ideal >>>>>> world, it would be streaming). >>>>>> GetTasksForPartition - Returns scan tasks for a partition (should >>>>>> also be paginated/streaming, but this is up for debate). I think it is >>>>>> an >>>>>> important consideration to allow for empty partitions. >>>>>> >>>>>> With this implementation you don't necessarily require separate >>>>>> server side state (objects in GCS should be sufficient), I think as Ryan >>>>>> suggested, one implementation could be to have each partition correspond >>>>>> to >>>>>> a byte-range in a manifest file for returning the tasks. >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:55 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>> Not necessarily. That is more of a general statement. The pagination >>>>>> discussion forked from server side scan planning. >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:52 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> > With start/limit each client can query for own's chunk without >>>>>> coordination. >>>>>> >>>>>> Okay, I understand now. Would you need to parallelize the client for >>>>>> listing namespaces or tables? That seems odd to me. >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:48 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>> > You can parallelize with opaque tokens by sending a starting point >>>>>> for the next request. >>>>>> >>>>>> I meant we would have to wait for the server to return this starting >>>>>> point from the past request? With start/limit each client can query for >>>>>> own's chunk without coordination. >>>>>> >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:44 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> > I think start and offset has the advantage of being parallelizable >>>>>> (as compared to continuation tokens). >>>>>> >>>>>> You can parallelize with opaque tokens by sending a starting point >>>>>> for the next request. >>>>>> >>>>>> > On the other hand, using "asOf" can be complex to implement and >>>>>> may be too powerful for the pagination use case >>>>>> >>>>>> I don't think that we want to add `asOf`. If the service chooses to >>>>>> do this, it would send a continuation token that has the >>>>>> information embedded. >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:42 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>> Can we assume it is the responsibility of the server to ensure >>>>>> determinism (e.g., by caching the results along with query ID)? I think >>>>>> start and offset has the advantage of being parallelizable (as compared >>>>>> to >>>>>> continuation tokens). On the other hand, using "asOf" can be complex to >>>>>> implement and may be too powerful for the pagination use case (because >>>>>> it >>>>>> allows to query the warehouse as of any point of time, not just now). >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:40 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> I think you can solve the atomicity problem with a continuation token >>>>>> and server-side state. In general, I don't think this is a problem we >>>>>> should worry about a lot since pagination commonly has this problem. But >>>>>> since we can build a system that allows you to solve it if you choose to, >>>>>> we should go with that design. >>>>>> >>>>>> On Tue, Dec 19, 2023 at 9:13 AM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>> Hi Jack, >>>>>> Some answers inline. >>>>>> >>>>>> >>>>>> In addition to the start index approach, another potential simple way >>>>>> to implement the continuation token is to use the last item name, when >>>>>> the >>>>>> listing is guaranteed to be in lexicographic order. >>>>>> >>>>>> >>>>>> I think this is one viable implementation, but the reason that the >>>>>> token should be opaque is that it allows several different >>>>>> implementations >>>>>> without client side changes. >>>>>> >>>>>> For example, if an element is added before the continuation token, >>>>>> then all future listing calls with the token would always skip that >>>>>> element. >>>>>> >>>>>> >>>>>> IMO, I think this is fine, for some of the REST APIs it is likely >>>>>> important to put constraints on atomicity requirements, for others (e.g. >>>>>> list namespaces) I think it is OK to have looser requirements. >>>>>> >>>>>> If we want to enforce that level of atomicity, we probably want to >>>>>> introduce another time travel query parameter (e.g. asOf=1703003028000) >>>>>> to >>>>>> ensure that we are listing results at a specific point of time of the >>>>>> warehouse, so the complete result list is fixed. >>>>>> >>>>>> >>>>>> Time travel might be useful in some cases but I think it is >>>>>> orthogonal to services wishing to have guarantees around >>>>>> atomicity/consistency of results. If a server wants to ensure that >>>>>> results >>>>>> are atomic/consistent as of the start of the listing, it can embed the >>>>>> necessary timestamp in the token it returns and parse it out when >>>>>> fetching >>>>>> the next result. >>>>>> >>>>>> I think this does raise a more general point around service >>>>>> definition evolution in general. I think there likely need to be >>>>>> metadata >>>>>> endpoints that expose either: >>>>>> 1. A version of the REST API supported. >>>>>> 2. Features the API supports (e.g. which query parameters are >>>>>> honored for a specific endpoint). >>>>>> >>>>>> There are pros and cons to both approaches (apologies if I missed >>>>>> this in the spec or if it has already been discussed). >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 19, 2023 at 8:25 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>> Yes I agree that it is better to not enforce the implementation to >>>>>> favor any direction, and continuation token is probably better than >>>>>> enforcing a numeric start index. >>>>>> >>>>>> In addition to the start index approach, another potential simple way >>>>>> to implement the continuation token is to use the last item name, when >>>>>> the >>>>>> listing is guaranteed to be in lexicographic order. Compared to the start >>>>>> index approach, it does not need to worry about the change of start index >>>>>> when something in the list is added or removed. >>>>>> >>>>>> However, the issue of concurrent modification could still exist even >>>>>> with a continuation token. For example, if an element is added before the >>>>>> continuation token, then all future listing calls with the token would >>>>>> always skip that element. If we want to enforce that level of atomicity, >>>>>> we >>>>>> probably want to introduce another time travel query parameter (e.g. >>>>>> asOf=1703003028000) to ensure that we are listing results at a specific >>>>>> point of time of the warehouse, so the complete result list is fixed. >>>>>> (This >>>>>> is also the missing piece I forgot to mention in the start index approach >>>>>> to ensure it works in distributed settings) >>>>>> >>>>>> -Jack >>>>>> >>>>>> On Tue, Dec 19, 2023, 9:51 AM Micah Kornfield <emkornfi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> I tried to cover these in more details at: >>>>>> https://docs.google.com/document/d/1bbfoLssY1szCO_Hm3_93ZcN0UAMpf7kjmpwHQngqQJ0/edit >>>>>> >>>>>> On Sun, Dec 17, 2023 at 6:07 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> +1 for this approach. I agree that the streaming approach requires >>>>>> that http client and servers have http 2 streaming support, which is not >>>>>> compatible with old clients. >>>>>> >>>>>> I share the same concern with Micah that only start/limit may not be >>>>>> enough in a distributed environment where modification happens during >>>>>> iterations. For compatibility, we need to consider several cases: >>>>>> >>>>>> 1. Old client <-> New Server >>>>>> 2. New client <-> Old server >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Dec 16, 2023 at 6:51 AM Daniel Weeks <dwe...@apache.org> >>>>>> wrote: >>>>>> >>>>>> I agree that we want to include this feature and I raised similar >>>>>> concerns to what Micah already presented in talking with Ryan. >>>>>> >>>>>> For backward compatibility, just adding a start and limit implies a >>>>>> deterministic order, which is not a current requirement of the REST spec. >>>>>> >>>>>> Also, we need to consider whether the start/limit would need to be >>>>>> respected by the server. If existing implementations simply return all >>>>>> the >>>>>> results, will that be sufficient? There are a few edge cases that need >>>>>> to >>>>>> be considered here. >>>>>> >>>>>> For the opaque key approach, I think adding a query param to >>>>>> trigger/continue and introducing a continuation token in >>>>>> the ListNamespacesResponse might allow for more backward compatibility. >>>>>> In >>>>>> that scenario, pagination would only take place for clients who know how >>>>>> to >>>>>> paginate and the ordering would not need to be deterministic. >>>>>> >>>>>> -Dan >>>>>> >>>>>> On Fri, Dec 15, 2023, 10:33 AM Micah Kornfield <emkornfi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Just to clarify and add a small suggestion: >>>>>> >>>>>> The behavior with no additional parameters requires the operations to >>>>>> happen as they do today for backwards compatibility (i.e either all >>>>>> responses are returned or a failure occurs). >>>>>> >>>>>> For new parameters, I'd suggest an opaque start token (instead of >>>>>> specific numeric offset) that can be returned by the service and a limit >>>>>> (as proposed above). If a start token is provided without a limit a >>>>>> default limit can be chosen by the server. Servers might return less >>>>>> than >>>>>> limit (i.e. clients are required to check for a next token to determine >>>>>> if >>>>>> iteration is complete). This enables server side state if it is desired >>>>>> but also makes deterministic listing much more feasible (deterministic >>>>>> responses are essentially impossible in the face of changing data if >>>>>> only a >>>>>> start offset is provided). >>>>>> >>>>>> In an ideal world, specifying a limit would result in streaming >>>>>> responses being returned with the last part either containing a token if >>>>>> continuation is necessary. Given conversation on the other thread of >>>>>> streaming, I'd imagine this is quite hard to model in an Open API REST >>>>>> service. >>>>>> >>>>>> Therefore it seems like using pagination with token and offset would >>>>>> be preferred. If skipping someplace in the middle of the namespaces is >>>>>> required then I would suggest modelling those as first class query >>>>>> parameters (e.g. "startAfterNamespace") >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> On Fri, Dec 15, 2023 at 10:08 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> +1 for this approach >>>>>> >>>>>> I think it's good to use query params because it can be >>>>>> backward-compatible with the current behavior. If you get more than the >>>>>> limit back, then the service probably doesn't support pagination. And if >>>>>> a >>>>>> client doesn't support pagination they get the same results that they >>>>>> would >>>>>> today. A streaming approach with a continuation link like in the scan API >>>>>> discussion wouldn't work because old clients don't know to make a second >>>>>> request. >>>>>> >>>>>> On Thu, Dec 14, 2023 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> During the conversation of the Scan API for REST spec, we touched on >>>>>> the topic of pagination when REST response is large or takes time to be >>>>>> produced. >>>>>> >>>>>> I just want to discuss this separately, since we also see the issue >>>>>> for ListNamespaces and ListTables/Views, when integrating with a large >>>>>> organization that has over 100k namespaces, and also a lot of tables in >>>>>> some namespaces. >>>>>> >>>>>> Pagination requires either keeping state, or the response to be >>>>>> deterministic such that the client can request a range of the full >>>>>> response. If we want to avoid keeping state, I think we need to allow >>>>>> some >>>>>> query parameters like: >>>>>> - *start*: the start index of the item in the response >>>>>> - *limit*: the number of items to be returned in the response >>>>>> >>>>>> So we can send a request like: >>>>>> >>>>>> *GET /namespaces?start=300&limit=100* >>>>>> >>>>>> *GET /namespaces/ns/tables?start=300&limit=100* >>>>>> >>>>>> And the REST spec should enforce that the response returned for the >>>>>> paginated GET should be deterministic. >>>>>> >>>>>> Any thoughts on this? >>>>>> >>>>>> Best, >>>>>> Jack Ye >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>>>> Xuanwo >>>>>> >>>>>>