Re: [PROPOSAL] Asynchronous & Reliable Tasks

Robert Stupp Mon, 04 Aug 2025 02:17:48 -0700

RIght, the idea is to have a "common abstraction" first.
I'm actively looking into exactly that at the moment. WIll come up
with a couple PRs to enable this.
Some of it is implicitly covered by the work that Christopher's
contributing, although it's rather orthogonal.


On Fri, Aug 1, 2025 at 6:54 PM Eric Maynard <[email protected]> wrote:
>
> I agree with Robert that the current implementation is not good and should
> be ripped out ASAP. However, I see this effort as complementary to Will's
> refactor, not as a dependency. We should first add a layer of abstraction
> between the business logic in Polaris and the task execution -- once that's
> in place, we can replace the existing task implementation behind that
> abstraction. At the same time, adding this abstraction will unlock the
> ability for us to implement remote task execution as well.
>
> --EM
>
> On Fri, Aug 1, 2025 at 6:31 AM Yufei Gu <[email protected]> wrote:
>
> > Thanks for the async task proposal. I think it's the right direction
> > for async light tasks. Meanwhile, we will still need other models:
> > 1. A scalable way to execute synchronous tasks
> > 2. A scalable way to execute heavy async tasks, e.g., table maintenance
> > tasks.
> >
> > The delegation service[1] is a good candidate for that.
> >
> > 1.
> >
> > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.xjibr7sfbv6a
> >
> > Yufei
> >
> >
> > On Thu, Jul 31, 2025 at 11:37 AM Russell Spitzer <
> > [email protected]>
> > wrote:
> >
> > > I'm fine with the plan although I think we should probably change step 4
> > > to allow both the current implementation and the new implementation to
> > > exist at the same time with a flag for switching over to the new task
> > > implementation. While the new implementation may be much better, it is a
> > > pretty significant behavior change that I think should be opt in until
> > it's
> > > been in Polaris for a release or two. After that we could force all users
> > > to switch once it's been out in the wild for a bit.
> > >
> > > On 2025/07/30 01:30:43 William Hyun wrote:
> > > > >
> > > > > Considering the current issues, I don't think it's worth the effort
> > to
> > > > > keep the current implementation.
> > > >
> > > >
> > > > It seems risky to me to not support the current implementation at least
> > > for
> > > > the period where the new tasks implementation is unstable.
> > > >
> > > > Bests,
> > > > William
> > > >
> > > > On Tue, Jul 29, 2025 at 3:49 AM Robert Stupp <[email protected]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > (starting w/ a recap for everybody watching this thread)
> > > > > The goal of this is to have a mechanism to guarantee the _eventual_
> > > > > execution of a task. That may happen immediately on the same node or
> > > > > at a later time on another node.
> > > > > This particular "async reliable tasks" is to ensure that tasks run
> > > > > eventually in any Polaris node. The related "Delegation Service"
> > > > > proposal is to let tasks run in a separate, different remote service.
> > > > > But it requires a "local fallback" in case the remote service is not
> > > > > available, which would be provided by this proposal.
> > > > >
> > > > > Currently, all scheduled and running tasks are "lost", if Polaris is
> > > > > stopped, killed or crashed. So I'd prefer to get this proposal in
> > > > > first to address the current issues and have a reliable fallback for
> > > > > the Delegation Service.
> > > > >
> > > > > Considering the current issues, I don't think it's worth the effort
> > to
> > > > > keep the current implementation.
> > > > >
> > > > > Both, this proposal and the Delegation Service, shouldn't rely on
> > > > > Polaris entities but rather have targeted definitions for the tasks
> > to
> > > > > execute, which contain exactly (and not more) what the tasks need to
> > > > > be executed.
> > > > >
> > > > > So I think the following steps (approx 1 PR for each) would be:
> > > > > 1. Add the tasks API (the draft PR [1])
> > > > > 2. Add the tasks implementation, w/o any persistence integration but
> > > > > with mock testing
> > > > > 3. Add persistence integration
> > > > > 4. Replace current task implementation with the new one
> > > > >
> > > > > I'll probably have more details soon-ish.
> > > > >
> > > > > Robert
> > > > >
> > > > > [1] https://github.com/apache/polaris/pull/2180
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 28, 2025 at 6:22 AM William Hyun <[email protected]>
> > > wrote:
> > > > > >
> > > > > > Hey Robert!
> > > > > >
> > > > > > Thank you for the draft PR.
> > > > > > I have taken a look and the general approach seems good to me.
> > > > > > However, one of my concerns would be the timeline to deliver this
> > new
> > > > > > task framework refactoring as this could be intrusive due to the
> > > scope
> > > > > > of the change.
> > > > > > What do you plan as the ETA for delivering this change?
> > > > > >
> > > > > > It seems we need to support both the pre-existing (v1) and new task
> > > > > > framework (v2) until we are sure that v2 is stabilized so that we
> > can
> > > > > > delete v1.
> > > > > > With the Delegation Service proposal being a new feature for
> > users, I
> > > > > > am proposing to include it within the 1.1 release as a small,
> > > optional
> > > > > > extension and also support it in v2 by reusing via implementing
> > v2's
> > > > > > SPI module as we previously discussed.
> > > > > > I also have opened a PR demonstrating what the Delegation Service
> > > > > > looks like here:
> > > > > >
> > > > > > - https://github.com/apache/polaris/pull/2193
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > Bests,
> > > > > > William
> > > > > >
> > > > > > On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > As discussed on the Polaris Community Sync today, we're aligned
> > > that
> > > > > > > the current tasks handling needs some refactoring.
> > > > > > >
> > > > > > > This proposal focuses on the "eventual execution" of a task.
> > > > > > > Implementations for would follow.
> > > > > > > The "Delegation Service" [1]  proposal focuses on the execution
> > of
> > > > > > > tasks "outside" of Polaris.
> > > > > > >
> > > > > > > I've pushed a draft PR [2] with the Java interfaces and value
> > types
> > > > > > > for the API, the SPI (behavior implementation) and store (used by
> > > > > > > tasks implementations).
> > > > > > >
> > > > > > > The only entry point is the `org.apache.polaris.tasks.api.Tasks`
> > > > > > > interface with a function defining the behavior and providing a
> > > > > > > parameter object (if necessary), returning a `TaskSubmission`.
> > Call
> > > > > > > sites _may_ subscribe to a `CompletionStage`, but the idea is
> > that
> > > > > > > it's rather "fire and forget" and the task behavior does
> > > "everything
> > > > > > > that's needed". This allows the task to be executed on any node.
> > > > > > > There's no guarantee in any form that a task will run "locally"
> > or
> > > any
> > > > > > > other specific node. Every Polaris node can handle task execution
> > > and
> > > > > > > perform failure/retry handling. Polaris nodes may use a "server"
> > > > > > > implementation or a "client" implementation or a "remote"
> > > > > > > implementation - that's defined upon deployment or by
> > configuration
> > > > > > > (TBD).
> > > > > > >
> > > > > > > I think that we can get to a Polaris internal API/SPI that can be
> > > > > > > leveraged by both proposals.
> > > > > > > This proposal is implementation and persistence backend agnostic.
> > > > > > > There could be a "server" implementation that can run tasks, a
> > > > > > > "client" implementation that can only submit tasks (think: from
> > the
> > > > > > > polaris-admin tool), and an implementation for the delegation
> > > service
> > > > > > > to execute tasks remotely.
> > > > > > >
> > > > > > > I do have a working implementation sitting around locally that's
> > > > > > > passing tests exercising concurrency, multi-node and failure
> > > > > > > scenarios. Since there's only a store-implementation for NoSQL, I
> > > > > > > haven't pushed that yet. Adding a store-implementation that
> > solely
> > > > > > > uses `BasePersistence``(JDBC) is not such a big deal.
> > > > > > >
> > > > > > > If we're okay with the approach in general, I can follow up with
> > a
> > > > > > > more concrete implementation including the "purge table" use case
> > > and
> > > > > > > maybe another example use case.
> > > > > > >
> > > > > > > Robert
> > > > > > >
> > > > > > > [1]
> > > https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> > > > > > > [2] https://github.com/apache/polaris/pull/2180
> > > > > > >
> > > > > > > On Mon, May 19, 2025 at 12:05 PM Robert Stupp <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > Yes, each "task behavior" has an ID. I've chosen the term "task
> > > > > > > > behavior" over "type", because it doesn't only define "what's
> > > done"
> > > > > but
> > > > > > > > also "when" it's done (delay) and "how it behaves" (retries on
> > > > > failures).
> > > > > > > >
> > > > > > > > On 14.05.25 04:25, Adnan Hemani wrote:
> > > > > > > > > Hi Robert,
> > > > > > > > >
> > > > > > > > > Firstly, thanks for this document. One quick question: is the
> > > > > `behavior ID` basically the task type? This part was slightly unclear
> > > to me.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Adnan Hemani
> > > > > > > > >
> > > > > > > > >> On May 9, 2025, at 6:07 AM, Robert Stupp <[email protected]>
> > > wrote:
> > > > > > > > >>
> > > > > > > > >> Hi,
> > > > > > > > >>
> > > > > > > > >> Polaris is a service, which has to eventually perform
> > > operations
> > > > > asynchronously. Polaris is also meant to be backed by multiple server
> > > > > instances (think: high-availability & load-balancing setups).
> > > > > > > > >>
> > > > > > > > >> During runtime, things can go sideways in many ways. Server
> > > > > instances may crash, be killed or whatever... Task executions may
> > fail,
> > > > > because some other remote service fails, configuration values (and
> > > > > credentials) may be wrong or other error situations.
> > > > > > > > >>
> > > > > > > > >> Task execution should be resilient to both kinds of
> > scenarios:
> > > > > being able to eventually recover from a "dead/lost node" scenario and
> > > to
> > > > > retry failed tasks.
> > > > > > > > >>
> > > > > > > > >> Each individual task should also be executed only once.
> > > > > > > > >>
> > > > > > > > >> There are also different kinds of tasks with different
> > > behaviors:
> > > > > the "function" being executed and the retry behavior.
> > > > > > > > >>
> > > > > > > > >> Proposal doc for this:
> > > > >
> > >
> > https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi
> > > > > > > > >>
> > > > > > > > >> Robert
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Robert Stupp
> > > > > > > > >> @snazy
> > > > > > > > >>
> > > > > > > > --
> > > > > > > > Robert Stupp
> > > > > > > > @snazy
> > > > > > > >
> > > > >
> > > >
> > >
> >

Re: [PROPOSAL] Asynchronous & Reliable Tasks

Reply via email to