Thanks Robert, this is helpful feedback. I think there may be a scope mismatch between the intent of the current document and how “push mode” is being interpreted. The current doc is mainly trying to capture architectural directions and terminology discussed during the sprint, especially the distinction between pull mode and push mode. The goal is not yet to standardize a full distributed task execution or reliability contract. To share some more context, we agreed to publish a short doc for architectural directions in two sprints(one in Feb, one in April). This PR (3990) is based on it. I think JB intialized it a few month ago.
I agree the topics you raised, durable task state, retry semantics, failure handling, credential scoping, request budgets, operational guarantees, etc., are important discussions, especially once we move toward production semantics for async execution. But I do not think the current document is trying to define those guarantees yet. It is more intended as a design/proposal level document describing possible execution/deployment models and the general direction the community discussed. I also agree that we should avoid overstating the maturity of push mode. We can clarify in the document that push mode is still conceptual/proposed and that the detailed operational and reliability contracts remain future work. Yufei On Tue, May 19, 2026 at 5:48 AM Robert Stupp <[email protected]> wrote: > Hi all, > > thanks for creating the doc and for splitting the discussion into pull and > push mode. > > I think that terminology is useful and helps to separate two very different > cases. > > I agree that pull and push are useful options to discuss. > I also think this is the right time to clarify whether push mode should be > release documentation already, and what contract would be behind it. > > I am not objecting to the direction. > > I am objecting to publishing push mode as release documentation before we > have defined its contract. > > Pull mode mostly looks like a normal REST/OAuth client pattern. > I am not sure that needs a separate Delegation Service specification. > I think pull mode is a good fit when the external service owns the > workflow. > > When Polaris exposes the operation as Polaris behavior, for example DROP > TABLE PURGE or server-side scan planning, Polaris owns the contract. > > For purge, that means durable state and eventual completion. > For scan planning, that means bounded request behavior: timeouts, > cancellation, resource limits, result-size limits, fallback behavior, and > cache ownership. > > After that, pull vs push is mostly about where execution runs. > > Remote push mode is still different operationally: > > Polaris needs to coordinate with another separately deployed service that > can fail independently, but users will still hold Polaris responsible for > the correct result. > That means the contract must define retry, failure handling, credentials, > status, and operator controls. > > It also crosses security and service boundaries. > > The contract needs to define who the worker acts as, which credentials it > gets, and how those credentials are scoped. > It also needs to define how Polaris and the worker safely talk to each > other across Kubernetes service, network, and proxy boundaries. > > Once documented as release behavior, users will expect Polaris to define > what happens when Polaris, the worker, the object store, or the network > fails. > > I do not think that contract exists yet. > So I think this should either stay a design/proposal note for now, or the > release documentation should clearly say that the push-mode contract is > still TBD. > > I think the good news is that the "Asynchronous & Reliable Tasks" proposal > already gives us a simpler foundation: > Polaris should own the durable task state, meaning the persistent record of > what work exists, whether it finished, and what needs retry. > With that, the default deployment can stay simple, and remote execution can > still be added later as an optional executor backend. > > I also think we should separate the advanced deployment option from the > common user path. > > A remote push-mode Delegation Service can be useful for deployments that > already have the operational machinery for separate worker services. > But for many self-hosted users it also means another service to deploy, > secure, monitor, scale, upgrade, and debug. > > So I would prefer that the common path stays simple first: Polaris owns the > durable task state, and operators can run the worker in the same deployment > or same image. > > Remote execution can then be added as an optional executor backend without > making it the baseline model for everyone. > > The failure cases below are why I think this matters. > They are not a request to solve every detail in this PR. > > For example: > > * What happens if the user-visible drop succeeds, but the purge task is not > recorded yet? > This matters when entities and tasks are served by different SPIs or > backends. > Atomicity across those writes cannot then be assumed. > > * What happens if a worker deletes some files and then crashes? > Who owns retry? > Where is progress recorded? > Can another node safely resume a crashed node's work? > > * What happens if the worker needs to call Polaris after the table is > already > hidden or dropped from the normal API surface? > This creates a cyclic dependency unless the task contains the information > needed to continue without rediscovering the table through loadTable. > > * Server-side scan planning is also not a simple service call. > It either needs a query engine, or the relevant planning parts of one. > At minimum, the contract needs request budgets: timeouts, cancellation, > backpressure, result-size limits, fallback behavior, and cache ownership. > > The existing proposals already contain most of the useful building blocks. > > For me, the safer order is to define the guarantees first, then document > the deployment modes on top. > > One possible path could roughly look like this: > > 1. Define how destructive operations persist the intent for DROP TABLE > PURGE. > The important part is that the user-visible drop and the purge intent > are > recorded atomically. > > 2. Building on the "Asynchronous & Reliable Tasks" work for the durable > Polaris task control plane gives us deterministic task IDs, task state, > retry/lost-task recovery, and admin-visible status. > > 3. Using the "Object store functionality" work as the execution library > for purge/file cleanup gives us streaming file discovery, bulk deletes, > rate limiting, stats, and lower heap pressure. > > 4. Wire DROP TABLE PURGE to a reliable task behavior using those object > store > operations. > Once Polaris returns success, the table is hidden from normal catalog > APIs > and the purge intent is durable. > File deletion can continue asynchronously and survive process restarts. > > 5. Then consider deployment variants. > A same-image task runner gives self-hosted operators isolation and > separate scaling without a second protocol or persistence model. > A remote Delegation Service can still be added later as an optional > executor backend if SaaS deployments need that shape. > > This is not meant to block pull/push terminology. > It is also not meant to rule out remote execution. > I am mostly trying to avoid publishing push mode as supported release > behavior before the task, security, request-budget, and operational > contracts are defined. > > So I would prefer to keep this PR as a design/proposal note for now, or > make the released documentation explicit that push mode is still TBD. > > My worry is that otherwise we ship a simple-looking doc that commits the > project to a surprisingly complex distributed-systems design. > > Robert > > On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]> wrote: > > > Hi folks, > > > > Sharing a few updates regarding the delegation service design doc. JB > and I > > will be co-authoring the document, and the PR has been updated > accordingly. > > > > Please take a look at the latest changes here: > > https://github.com/apache/polaris/pull/3990 > > > > Yufei > > > > > > On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]> wrote: > > > > > Hi everyone, > > > > > > We had a productive discussion on the delegation service during the > > > Polaris Sprint on April 7, thanks all for the great input. > > > > > > As a quick summary, the current direction is to condense the design > > doc[1] > > > and focus on the two options the community seems to prefer moving > forward > > > with: pull mode and push mode. The goal is to keep the doc concise and > > > briefly describe these two modes. > > > > > > Please let me know if I missed anything. And Looking forward your > > feedback. > > > > > > 1. https://github.com/apache/polaris/pull/3990 > > > > > > Thanks, > > > Yufei > > > > > >
