Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Jean-Baptiste Onofré Wed, 20 May 2026 07:21:38 -0700

Hi Robert,

The PR is currently a draft, and my intent when creating it was to
facilitate discussion on the dev@ mailing list.


I am fine with moving it to the proposal area for now. We can move it back
to the documentation once we have reached a consensus.

I chose to start this as a PR rather than a Google Doc for two reasons:
1. To evaluate how efficiently we can collaborate via PR and explore the
related changes needed in the Polaris core (API/SPI, etc.).
2. To simplify the merge process once we have consensus, as the ultimate
goal is to update the documentation.

Regards,
JB

On Wed, May 20, 2026 at 1:23 PM Robert Stupp <[email protected]> wrote:

> Thanks Yufei, that helps.
>
> If the intent is proposal/design-level direction, I think we are mostly
> aligned then.
>
> My main concern is the placement/wording of the doc.
>
> If this is published as release documentation, users will read it as
> supported behavior.
>
> So I think the PR should make this very explicit:
> push mode is conceptual/proposed, and the concrete task lifecycle,
> reliability, security, request-budget, and operational contracts are future
> work.
>
> Maybe the cleanest option is to keep this under the existing
> community/proposals area for now, rather than under release documentation.
> That would match the current status better: useful architectural direction,
> but not yet a supported push-mode contract.
>
> Thanks also for the context from the sprint discussions, that is useful
> background.
>
> For the project decision, I think we should make sure the desired direction
> is explicit on the dev list.
> Same for the open contract questions.
> Then the community can validate or challenge them here and build consensus
> on that.
>
> With that clarification, I think the pull/push terminology is useful.
>
> For the actual execution semantics, I still think the safer foundation is
> the durable task-state approach from the async/reliable tasks proposal.
>
> Polaris owns the persistent record of what work exists, whether it
> finished, and what needs retry.
>
> Remote execution can then still be added later as an optional executor
> backend, without making it the baseline model for everyone.
>
> Robert
>
> On Wed, May 20, 2026 at 2:53 AM Yufei Gu <[email protected]> wrote:
>
> > Thanks Robert, this is helpful feedback.
> >
> > I think there may be a scope mismatch between the intent of the current
> > document and how “push mode” is being interpreted. The current doc is
> > mainly trying to capture architectural directions and terminology
> discussed
> > during the sprint, especially the distinction between pull mode and push
> > mode. The goal is not yet to standardize a full distributed task
> execution
> > or reliability contract. To share some more context, we agreed to
> publish a
> > short doc for architectural directions in two sprints(one in Feb, one in
> > April). This PR (3990) is based on it. I think JB intialized it a few
> month
> > ago.
> >
> > I agree the topics you raised, durable task state, retry semantics,
> failure
> > handling, credential scoping, request budgets, operational guarantees,
> > etc., are important discussions, especially once we move toward
> production
> > semantics for async execution. But I do not think the current document is
> > trying to define those guarantees yet. It is more intended as a
> > design/proposal level document describing possible execution/deployment
> > models and the general direction the community discussed.
> >
> > I also agree that we should avoid overstating the maturity of push mode.
> We
> > can clarify in the document that push mode is still conceptual/proposed
> and
> > that the detailed operational and reliability contracts remain future
> work.
> >
> > Yufei
> >
> >
> > On Tue, May 19, 2026 at 5:48 AM Robert Stupp <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > thanks for creating the doc and for splitting the discussion into pull
> > and
> > > push mode.
> > >
> > > I think that terminology is useful and helps to separate two very
> > different
> > > cases.
> > >
> > > I agree that pull and push are useful options to discuss.
> > > I also think this is the right time to clarify whether push mode should
> > be
> > > release documentation already, and what contract would be behind it.
> > >
> > > I am not objecting to the direction.
> > >
> > > I am objecting to publishing push mode as release documentation before
> we
> > > have defined its contract.
> > >
> > > Pull mode mostly looks like a normal REST/OAuth client pattern.
> > > I am not sure that needs a separate Delegation Service specification.
> > > I think pull mode is a good fit when the external service owns the
> > > workflow.
> > >
> > > When Polaris exposes the operation as Polaris behavior, for example
> DROP
> > > TABLE PURGE or server-side scan planning, Polaris owns the contract.
> > >
> > > For purge, that means durable state and eventual completion.
> > > For scan planning, that means bounded request behavior: timeouts,
> > > cancellation, resource limits, result-size limits, fallback behavior,
> and
> > > cache ownership.
> > >
> > > After that, pull vs push is mostly about where execution runs.
> > >
> > > Remote push mode is still different operationally:
> > >
> > > Polaris needs to coordinate with another separately deployed service
> that
> > > can fail independently, but users will still hold Polaris responsible
> for
> > > the correct result.
> > > That means the contract must define retry, failure handling,
> credentials,
> > > status, and operator controls.
> > >
> > > It also crosses security and service boundaries.
> > >
> > > The contract needs to define who the worker acts as, which credentials
> it
> > > gets, and how those credentials are scoped.
> > > It also needs to define how Polaris and the worker safely talk to each
> > > other across Kubernetes service, network, and proxy boundaries.
> > >
> > > Once documented as release behavior, users will expect Polaris to
> define
> > > what happens when Polaris, the worker, the object store, or the network
> > > fails.
> > >
> > > I do not think that contract exists yet.
> > > So I think this should either stay a design/proposal note for now, or
> the
> > > release documentation should clearly say that the push-mode contract is
> > > still TBD.
> > >
> > > I think the good news is that the "Asynchronous & Reliable Tasks"
> > proposal
> > > already gives us a simpler foundation:
> > > Polaris should own the durable task state, meaning the persistent
> record
> > of
> > > what work exists, whether it finished, and what needs retry.
> > > With that, the default deployment can stay simple, and remote execution
> > can
> > > still be added later as an optional executor backend.
> > >
> > > I also think we should separate the advanced deployment option from the
> > > common user path.
> > >
> > > A remote push-mode Delegation Service can be useful for deployments
> that
> > > already have the operational machinery for separate worker services.
> > > But for many self-hosted users it also means another service to deploy,
> > > secure, monitor, scale, upgrade, and debug.
> > >
> > > So I would prefer that the common path stays simple first: Polaris owns
> > the
> > > durable task state, and operators can run the worker in the same
> > deployment
> > > or same image.
> > >
> > > Remote execution can then be added as an optional executor backend
> > without
> > > making it the baseline model for everyone.
> > >
> > > The failure cases below are why I think this matters.
> > > They are not a request to solve every detail in this PR.
> > >
> > > For example:
> > >
> > > * What happens if the user-visible drop succeeds, but the purge task is
> > not
> > >   recorded yet?
> > >   This matters when entities and tasks are served by different SPIs or
> > >   backends.
> > >   Atomicity across those writes cannot then be assumed.
> > >
> > > * What happens if a worker deletes some files and then crashes?
> > >   Who owns retry?
> > >   Where is progress recorded?
> > >   Can another node safely resume a crashed node's work?
> > >
> > > * What happens if the worker needs to call Polaris after the table is
> > > already
> > >   hidden or dropped from the normal API surface?
> > >   This creates a cyclic dependency unless the task contains the
> > information
> > >   needed to continue without rediscovering the table through loadTable.
> > >
> > > * Server-side scan planning is also not a simple service call.
> > >   It either needs a query engine, or the relevant planning parts of
> one.
> > >   At minimum, the contract needs request budgets: timeouts,
> cancellation,
> > >   backpressure, result-size limits, fallback behavior, and cache
> > ownership.
> > >
> > > The existing proposals already contain most of the useful building
> > blocks.
> > >
> > > For me, the safer order is to define the guarantees first, then
> document
> > > the deployment modes on top.
> > >
> > > One possible path could roughly look like this:
> > >
> > > 1. Define how destructive operations persist the intent for DROP TABLE
> > > PURGE.
> > >    The important part is that the user-visible drop and the purge
> intent
> > > are
> > >    recorded atomically.
> > >
> > > 2. Building on the "Asynchronous & Reliable Tasks" work for the durable
> > >    Polaris task control plane gives us deterministic task IDs, task
> > state,
> > >    retry/lost-task recovery, and admin-visible status.
> > >
> > > 3. Using the "Object store functionality" work as the execution library
> > >    for purge/file cleanup gives us streaming file discovery, bulk
> > deletes,
> > >    rate limiting, stats, and lower heap pressure.
> > >
> > > 4. Wire DROP TABLE PURGE to a reliable task behavior using those object
> > > store
> > >    operations.
> > >    Once Polaris returns success, the table is hidden from normal
> catalog
> > > APIs
> > >    and the purge intent is durable.
> > >    File deletion can continue asynchronously and survive process
> > restarts.
> > >
> > > 5. Then consider deployment variants.
> > >    A same-image task runner gives self-hosted operators isolation and
> > >    separate scaling without a second protocol or persistence model.
> > >    A remote Delegation Service can still be added later as an optional
> > >    executor backend if SaaS deployments need that shape.
> > >
> > > This is not meant to block pull/push terminology.
> > > It is also not meant to rule out remote execution.
> > > I am mostly trying to avoid publishing push mode as supported release
> > > behavior before the task, security, request-budget, and operational
> > > contracts are defined.
> > >
> > > So I would prefer to keep this PR as a design/proposal note for now, or
> > > make the released documentation explicit that push mode is still TBD.
> > >
> > > My worry is that otherwise we ship a simple-looking doc that commits
> the
> > > project to a surprisingly complex distributed-systems design.
> > >
> > > Robert
> > >
> > > On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]>
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Sharing a few updates regarding the delegation service design doc. JB
> > > and I
> > > > will be co-authoring the document, and the PR has been updated
> > > accordingly.
> > > >
> > > > Please take a look at the latest changes here:
> > > > https://github.com/apache/polaris/pull/3990
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]>
> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We had a productive discussion on the delegation service during the
> > > > > Polaris Sprint on April 7, thanks all for the great input.
> > > > >
> > > > > As a quick summary, the current direction is to condense the design
> > > > doc[1]
> > > > > and focus on the two options the community seems to prefer moving
> > > forward
> > > > > with: pull mode and push mode. The goal is to keep the doc concise
> > and
> > > > > briefly describe these two modes.
> > > > >
> > > > > Please let me know if I missed anything. And Looking forward your
> > > > feedback.
> > > > >
> > > > > 1. https://github.com/apache/polaris/pull/3990
> > > > >
> > > > > Thanks,
> > > > > Yufei
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Reply via email to