Hi Robert, The PR is currently a draft, and my intent when creating it was to facilitate discussion on the dev@ mailing list.
I am fine with moving it to the proposal area for now. We can move it back to the documentation once we have reached a consensus. I chose to start this as a PR rather than a Google Doc for two reasons: 1. To evaluate how efficiently we can collaborate via PR and explore the related changes needed in the Polaris core (API/SPI, etc.). 2. To simplify the merge process once we have consensus, as the ultimate goal is to update the documentation. Regards, JB On Wed, May 20, 2026 at 1:23 PM Robert Stupp <[email protected]> wrote: > Thanks Yufei, that helps. > > If the intent is proposal/design-level direction, I think we are mostly > aligned then. > > My main concern is the placement/wording of the doc. > > If this is published as release documentation, users will read it as > supported behavior. > > So I think the PR should make this very explicit: > push mode is conceptual/proposed, and the concrete task lifecycle, > reliability, security, request-budget, and operational contracts are future > work. > > Maybe the cleanest option is to keep this under the existing > community/proposals area for now, rather than under release documentation. > That would match the current status better: useful architectural direction, > but not yet a supported push-mode contract. > > Thanks also for the context from the sprint discussions, that is useful > background. > > For the project decision, I think we should make sure the desired direction > is explicit on the dev list. > Same for the open contract questions. > Then the community can validate or challenge them here and build consensus > on that. > > With that clarification, I think the pull/push terminology is useful. > > For the actual execution semantics, I still think the safer foundation is > the durable task-state approach from the async/reliable tasks proposal. > > Polaris owns the persistent record of what work exists, whether it > finished, and what needs retry. > > Remote execution can then still be added later as an optional executor > backend, without making it the baseline model for everyone. > > Robert > > On Wed, May 20, 2026 at 2:53 AM Yufei Gu <[email protected]> wrote: > > > Thanks Robert, this is helpful feedback. > > > > I think there may be a scope mismatch between the intent of the current > > document and how “push mode” is being interpreted. The current doc is > > mainly trying to capture architectural directions and terminology > discussed > > during the sprint, especially the distinction between pull mode and push > > mode. The goal is not yet to standardize a full distributed task > execution > > or reliability contract. To share some more context, we agreed to > publish a > > short doc for architectural directions in two sprints(one in Feb, one in > > April). This PR (3990) is based on it. I think JB intialized it a few > month > > ago. > > > > I agree the topics you raised, durable task state, retry semantics, > failure > > handling, credential scoping, request budgets, operational guarantees, > > etc., are important discussions, especially once we move toward > production > > semantics for async execution. But I do not think the current document is > > trying to define those guarantees yet. It is more intended as a > > design/proposal level document describing possible execution/deployment > > models and the general direction the community discussed. > > > > I also agree that we should avoid overstating the maturity of push mode. > We > > can clarify in the document that push mode is still conceptual/proposed > and > > that the detailed operational and reliability contracts remain future > work. > > > > Yufei > > > > > > On Tue, May 19, 2026 at 5:48 AM Robert Stupp <[email protected]> wrote: > > > > > Hi all, > > > > > > thanks for creating the doc and for splitting the discussion into pull > > and > > > push mode. > > > > > > I think that terminology is useful and helps to separate two very > > different > > > cases. > > > > > > I agree that pull and push are useful options to discuss. > > > I also think this is the right time to clarify whether push mode should > > be > > > release documentation already, and what contract would be behind it. > > > > > > I am not objecting to the direction. > > > > > > I am objecting to publishing push mode as release documentation before > we > > > have defined its contract. > > > > > > Pull mode mostly looks like a normal REST/OAuth client pattern. > > > I am not sure that needs a separate Delegation Service specification. > > > I think pull mode is a good fit when the external service owns the > > > workflow. > > > > > > When Polaris exposes the operation as Polaris behavior, for example > DROP > > > TABLE PURGE or server-side scan planning, Polaris owns the contract. > > > > > > For purge, that means durable state and eventual completion. > > > For scan planning, that means bounded request behavior: timeouts, > > > cancellation, resource limits, result-size limits, fallback behavior, > and > > > cache ownership. > > > > > > After that, pull vs push is mostly about where execution runs. > > > > > > Remote push mode is still different operationally: > > > > > > Polaris needs to coordinate with another separately deployed service > that > > > can fail independently, but users will still hold Polaris responsible > for > > > the correct result. > > > That means the contract must define retry, failure handling, > credentials, > > > status, and operator controls. > > > > > > It also crosses security and service boundaries. > > > > > > The contract needs to define who the worker acts as, which credentials > it > > > gets, and how those credentials are scoped. > > > It also needs to define how Polaris and the worker safely talk to each > > > other across Kubernetes service, network, and proxy boundaries. > > > > > > Once documented as release behavior, users will expect Polaris to > define > > > what happens when Polaris, the worker, the object store, or the network > > > fails. > > > > > > I do not think that contract exists yet. > > > So I think this should either stay a design/proposal note for now, or > the > > > release documentation should clearly say that the push-mode contract is > > > still TBD. > > > > > > I think the good news is that the "Asynchronous & Reliable Tasks" > > proposal > > > already gives us a simpler foundation: > > > Polaris should own the durable task state, meaning the persistent > record > > of > > > what work exists, whether it finished, and what needs retry. > > > With that, the default deployment can stay simple, and remote execution > > can > > > still be added later as an optional executor backend. > > > > > > I also think we should separate the advanced deployment option from the > > > common user path. > > > > > > A remote push-mode Delegation Service can be useful for deployments > that > > > already have the operational machinery for separate worker services. > > > But for many self-hosted users it also means another service to deploy, > > > secure, monitor, scale, upgrade, and debug. > > > > > > So I would prefer that the common path stays simple first: Polaris owns > > the > > > durable task state, and operators can run the worker in the same > > deployment > > > or same image. > > > > > > Remote execution can then be added as an optional executor backend > > without > > > making it the baseline model for everyone. > > > > > > The failure cases below are why I think this matters. > > > They are not a request to solve every detail in this PR. > > > > > > For example: > > > > > > * What happens if the user-visible drop succeeds, but the purge task is > > not > > > recorded yet? > > > This matters when entities and tasks are served by different SPIs or > > > backends. > > > Atomicity across those writes cannot then be assumed. > > > > > > * What happens if a worker deletes some files and then crashes? > > > Who owns retry? > > > Where is progress recorded? > > > Can another node safely resume a crashed node's work? > > > > > > * What happens if the worker needs to call Polaris after the table is > > > already > > > hidden or dropped from the normal API surface? > > > This creates a cyclic dependency unless the task contains the > > information > > > needed to continue without rediscovering the table through loadTable. > > > > > > * Server-side scan planning is also not a simple service call. > > > It either needs a query engine, or the relevant planning parts of > one. > > > At minimum, the contract needs request budgets: timeouts, > cancellation, > > > backpressure, result-size limits, fallback behavior, and cache > > ownership. > > > > > > The existing proposals already contain most of the useful building > > blocks. > > > > > > For me, the safer order is to define the guarantees first, then > document > > > the deployment modes on top. > > > > > > One possible path could roughly look like this: > > > > > > 1. Define how destructive operations persist the intent for DROP TABLE > > > PURGE. > > > The important part is that the user-visible drop and the purge > intent > > > are > > > recorded atomically. > > > > > > 2. Building on the "Asynchronous & Reliable Tasks" work for the durable > > > Polaris task control plane gives us deterministic task IDs, task > > state, > > > retry/lost-task recovery, and admin-visible status. > > > > > > 3. Using the "Object store functionality" work as the execution library > > > for purge/file cleanup gives us streaming file discovery, bulk > > deletes, > > > rate limiting, stats, and lower heap pressure. > > > > > > 4. Wire DROP TABLE PURGE to a reliable task behavior using those object > > > store > > > operations. > > > Once Polaris returns success, the table is hidden from normal > catalog > > > APIs > > > and the purge intent is durable. > > > File deletion can continue asynchronously and survive process > > restarts. > > > > > > 5. Then consider deployment variants. > > > A same-image task runner gives self-hosted operators isolation and > > > separate scaling without a second protocol or persistence model. > > > A remote Delegation Service can still be added later as an optional > > > executor backend if SaaS deployments need that shape. > > > > > > This is not meant to block pull/push terminology. > > > It is also not meant to rule out remote execution. > > > I am mostly trying to avoid publishing push mode as supported release > > > behavior before the task, security, request-budget, and operational > > > contracts are defined. > > > > > > So I would prefer to keep this PR as a design/proposal note for now, or > > > make the released documentation explicit that push mode is still TBD. > > > > > > My worry is that otherwise we ship a simple-looking doc that commits > the > > > project to a surprisingly complex distributed-systems design. > > > > > > Robert > > > > > > On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]> > wrote: > > > > > > > Hi folks, > > > > > > > > Sharing a few updates regarding the delegation service design doc. JB > > > and I > > > > will be co-authoring the document, and the PR has been updated > > > accordingly. > > > > > > > > Please take a look at the latest changes here: > > > > https://github.com/apache/polaris/pull/3990 > > > > > > > > Yufei > > > > > > > > > > > > On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]> > wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > We had a productive discussion on the delegation service during the > > > > > Polaris Sprint on April 7, thanks all for the great input. > > > > > > > > > > As a quick summary, the current direction is to condense the design > > > > doc[1] > > > > > and focus on the two options the community seems to prefer moving > > > forward > > > > > with: pull mode and push mode. The goal is to keep the doc concise > > and > > > > > briefly describe these two modes. > > > > > > > > > > Please let me know if I missed anything. And Looking forward your > > > > feedback. > > > > > > > > > > 1. https://github.com/apache/polaris/pull/3990 > > > > > > > > > > Thanks, > > > > > Yufei > > > > > > > > > > > > > > >
