Re: [DISCUSS] Polaris Delegation Service for Long-Running Tasks

William Hyun Thu, 17 Jul 2025 14:44:57 -0700

Hello all,

I've put together a proof-of-concept to help us explore some of the
implementation details and move the conversation forward.
This PoC demonstrates a basic task delegation flow from a Polaris
instance to a Delegation Service instance.


You can get started with the quickstart guide and review the code here:
- Quickstart Guide:
https://docs.google.com/document/d/1QfXL786fT0S3R6vMkK7dV_0CuYNUD4S4e5tpK5JHl1k/edit?usp=sharing
- Pull Request: https://github.com/williamhyun/polaris/pull/1

I would appreciate it if you could take some time to run through the
quickstart and share your thoughts, questions, or any concerns you
have in this thread or directly on the pull request.

Bests,
William


On Tue, Jul 15, 2025 at 2:24 AM Robert Stupp <sn...@snazy.de> wrote:
>
> Feel free to add it to the Polaris community sync agenda for Thu next
> week 
> (https://docs.google.com/document/d/1TAAMjCtk4KuWSwfxpCBhhK9vM1k_3n7YE4L28slclXU)
>
> On Tue, Jul 15, 2025 at 10:03 AM William Hyun <will...@apache.org> wrote:
> >
> > Hey Robert,
> >
> > Thank you for your review and comments!
> > To address some of your concerns,
> > 1. Polaris would fall back to local execution (current behavior) in this 
> > case.
> > 2. The delegation service would update the task status as a terminal
> > failure in its persistence, allowing users to retry once a reliable
> > Polaris instance is able to communicate with the delegation service.
> > 3. Additional systems for handling retries can be explored with
> > further discussions, but is currently not part of the MVP.
> >
> > These mostly seem to be implementation details, and I would be happy
> > to have a discussion with you on this!
> >
> > Bests,
> > William
> >
> > On Wed, Jul 9, 2025 at 7:36 AM Robert Stupp <sn...@snazy.de> wrote:
> > >
> > > Hi all,
> > >
> > > Overall Polaris deserves a thorough asynchronous task handling 
> > > infrastructure.
> > >
> > > The general difference to my proposal [1] is that this one is a
> > > dedicated service. It seems that there will be different
> > > implementations of task types depending on whether those are run
> > > inside Polaris or inside the new service, at least the (integration)
> > > test and maintenance efforts are higher. Having "dedicated task
> > > runners" (instances that do not serve IRC requests but only run tasks
> > > asynchornously) is possible with [1].
> > >
> > > The "dedicated service" proposal needs some clarification on a few 
> > > concerns.
> > > 1. Resiliency of Polaris in case the remote delegation service is not
> > > or not reliably available?
> > > 2. Resiliency of the delegation service in case Polaris is not or not
> > > reliably available?
> > > 3. I suspect that both sides require additional retry handling logic
> > > in case the respective remote side is not available. Are additional
> > > queuing/messaging systems needed?
> > >
> > > [1] does not require additional credential vending endpoints and does
> > > not require additional infrastructure (k8s, persistent state) nor an
> > > additional or separate code base.
> > >
> > > In summary, [1] would share the exact same code base in every setup,
> > > whether a user wants all server instance)s) to serve IRC and tasks or
> > > whether a user really wants dedicated instances only for tasks. This
> > > means that no additional testing overhead, no new publicly accessible
> > > security related endpoints, no new services to care about and
> > > maintain, no cross-service communication and no additional
> > > configuration overhead for users.
> > >
> > > PS: I have to mention that I'm a bit disappointed by this counter
> > > proposal to [1], where the latter did not receive a lot of attention
> > > since May 19.
> > >
> > > [1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
> > >
> > >
> > > On Thu, Jun 26, 2025 at 12:53 AM William Hyun <will...@apache.org> wrote:
> > > >
> > > > Hi Anurag,
> > > >
> > > > Thank you for your interest and taking the time to review the design 
> > > > doc!
> > > >
> > > > To answer some of your questions:
> > > > 1. The source of truth for all delegated tasks is within the
> > > > Delegation Service's own persistence layer.
> > > > 2. The current document abstracts away the implementation details of
> > > > the Delegation Service. The intent is to first agree on the high-level
> > > > architecture and the API contract between the services. For the
> > > > synchronous MVP, there is no traditional in-memory or message broker
> > > > queue. Instead, the persistence layer itself acts as a durable log; a
> > > > task is persisted upon submission and then processed by the API
> > > > thread. An example task execution loop has been added onto the
> > > > appendix outlining this approach.
> > > > 3. The plan is to provide the Delegation Service as a new, separate
> > > > Docker image to be deployed alongside the existing Polaris container.
> > > > We envision a one-to-one Polaris to Delegation Service security binary
> > > > enforced through the security measures outlined in the document. I
> > > > have included a new entry in the appendix discussing the high-level
> > > > approach.
> > > >
> > > > Thanks again for the valuable questions. Please let me know if these
> > > > clarifications address your concerns or if you have any further
> > > > thoughts.
> > > >
> > > > Bests,
> > > > William
> > > >
> > > > On Tue, Jun 24, 2025 at 5:35 PM Anurag Mantripragada
> > > > <amantriprag...@apple.com.invalid> wrote:
> > > > >
> > > > > Thank you for your proposal, Willam.
> > > > >
> > > > > This type of companion service is necessary, as evidenced by the 
> > > > > other proposal on asynchronous tasks. Overall, this is a promising 
> > > > > start. I understand that the scope for this proposal is limited, so 
> > > > > please feel free to indicate that it is not in scope. However, I have 
> > > > > a few questions:
> > > > >
> > > > > 1. Could you clarify in the documentation the source of truth for 
> > > > > task status? From your diagram, it appears that it is in the 
> > > > > delegation service.
> > > > > 2. The implementation details of the service are abstracted away. Are 
> > > > > these not in scope for this design? (For instance, do we have a task 
> > > > > queue in the delegation service?)
> > > > > 3. Could you provide additional details on how this service will be 
> > > > > deployed?
> > > > >
> > > > > It becomes very complicated when we transition from a synchronous 
> > > > > model to an asynchronous model. (Handling failures, task executor 
> > > > > unavailability, status updates, etc.) We can have a separate 
> > > > > discussion for those.
> > > > >
> > > > > Thank you,
> > > > > Anurag Mantripragada
> > > > >
> > > > >
> > > > > > On Jun 24, 2025, at 11:56 AM, William Hyun <will...@apache.org> 
> > > > > > wrote:
> > > > > >
> > > > > > Hey Dmitri,
> > > > > >
> > > > > > Thank you for your comments!
> > > > > >
> > > > > > I would like to first clarify that while the initial use case is
> > > > > > internal, we are not closing the door completely on having 
> > > > > > Delegation
> > > > > > Service be accessible through user-driven clients.
> > > > > > We would love this service to eventually be deployed and run
> > > > > > independently from the Polaris Catalog to handle scheduled,
> > > > > > asynchronous tasks as Eric mentioned above with compaction.
> > > > > > We believe the REST API is the foundational building block for that
> > > > > > evolution and the initial proposal aims to simply introduce the
> > > > > > framework to the Polaris ecosystem with the purge table task as the
> > > > > > main focal point.
> > > > > >
> > > > > > Secondly, in addressing the concern about task failures, I have 
> > > > > > added
> > > > > > a section in the appendix discussing the expected behavior of failed
> > > > > > tasks.
> > > > > > Please feel free to take a look and let me know what you think!
> > > > > > - 
> > > > > > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.fr5gi42vvat3
> > > > > >
> > > > > > Bests,
> > > > > > William
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 23, 2025 at 4:42 PM Dmitri Bourlatchkov 
> > > > > > <di...@apache.org> wrote:
> > > > > >>
> > > > > >> Apologies for missing the reference to Robert's doc. I hope it 
> > > > > >> does not
> > > > > >> invalidate my comments :)
> > > > > >>
> > > > > >> This is certainly up for discussion.
> > > > > >>
> > > > > >> To clarify my concern about the REST API: If we are to have 
> > > > > >> resilient tasks
> > > > > >> and the node that serves the initial REST request fails, other 
> > > > > >> nodes will
> > > > > >> have to be able to provide responses about the task instead of the 
> > > > > >> failed
> > > > > >> node. Ultimately the data will come from persistence (I assume). 
> > > > > >> Also, I
> > > > > >> suppose the Tasks Service is meant for internal interactions (not 
> > > > > >> for
> > > > > >> user-driven clients). Therefore, it seems to me that the REST API 
> > > > > >> is
> > > > > >> somewhat superficial in this case.
> > > > > >>
> > > > > >> Like I mentioned before, this is just what I thought after a quick 
> > > > > >> review.
> > > > > >> I'll certainly have a deeper look later.
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Dmitri.
> > > > > >>
> > > > > >> On Mon, Jun 23, 2025 at 6:02 PM Eric Maynard 
> > > > > >> <eric.w.mayn...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hey Dmitri,
> > > > > >>>
> > > > > >>> There's a section in the email above and the linked doc that 
> > > > > >>> talks about
> > > > > >>> the linked proposal. See "Relationship to the "Asynchronous & 
> > > > > >>> Reliable
> > > > > >>> Tasks" Proposal".
> > > > > >>>
> > > > > >>> As for pulling away from a REST API in favor of driving things 
> > > > > >>> directly
> > > > > >>> from persistence, there's a lot to discuss here. Bear in mind 
> > > > > >>> that the
> > > > > >>> design goes into detail about one proposed "TaskExecutor" 
> > > > > >>> implementation;
> > > > > >>> maybe another TaskExecutor could work exactly like you describe. 
> > > > > >>> But the
> > > > > >>> reason that this implementation proposes to be driven by a REST 
> > > > > >>> API is that
> > > > > >>> there's a lot of interesting future work -- see the "Future Work" 
> > > > > >>> section
> > > > > >>> of the doc for some examples -- that can be added on to the REST 
> > > > > >>> API. In
> > > > > >>> particular, table maintenance actions like compaction.
> > > > > >>>
> > > > > >>> --EM
> > > > > >>>
> > > > > >>> On Mon, Jun 23, 2025 at 2:31 PM Dmitri Bourlatchkov 
> > > > > >>> <di...@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Hi All,
> > > > > >>>>
> > > > > >>>> A previous proposal by Robert [1] from May 9 appears to be 
> > > > > >>>> related. I
> > > > > >>> think
> > > > > >>>> we should consider both at the same time, possibly as 
> > > > > >>>> alternatives, but
> > > > > >>>> perhaps also sharing / reusing their respective ideas.
> > > > > >>>>
> > > > > >>>> A few notes after a quick review:
> > > > > >>>>
> > > > > >>>> * Separate scaling for task executors seems reasonable at first 
> > > > > >>>> glance,
> > > > > >>> but
> > > > > >>>> it adds deployment complexity. If we go with this approach, I 
> > > > > >>>> believe it
> > > > > >>>> would be worth making this deployment strategy optional. In 
> > > > > >>>> other words
> > > > > >>> let
> > > > > >>>> admin users decide whether they want to have extra nodes 
> > > > > >>>> dedicated to
> > > > > >>>> specific tasks or whether they are ok with having uniform nodes.
> > > > > >>>>
> > > > > >>>> * I'm not sure a separate rich REST API for submitting tasks is 
> > > > > >>>> really
> > > > > >>>> necessary. Proper synchronization among multiple nodes will
> > > > > >>>> probably require roundtrips to Persistence anyway, so task 
> > > > > >>>> submission
> > > > > >>> could
> > > > > >>>> probably be done via Persistence.
> > > > > >>>>
> > > > > >>>> [1] 
> > > > > >>>> https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>> Dmitri.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Mon, Jun 23, 2025 at 3:12 PM William Hyun 
> > > > > >>>> <will...@apache.org> wrote:
> > > > > >>>>
> > > > > >>>>> Hello Polaris Community,
> > > > > >>>>>
> > > > > >>>>> I would like to share my proposal for a new service, the Polaris
> > > > > >>>>> Delegation Service, and to share the design document for 
> > > > > >>>>> discussion
> > > > > >>>>> and feedback. The Delegation Service is intended to optionally 
> > > > > >>>>> be
> > > > > >>>>> deployed alongside Polaris to handle the execution of certain
> > > > > >>>>> long-running tasks.
> > > > > >>>>>
> > > > > >>>>> 1. Motivation
> > > > > >>>>> The Polaris Catalog is optimized for low-latency metadata 
> > > > > >>>>> operations.
> > > > > >>>>> However, certain tasks such as purging data files for dropped 
> > > > > >>>>> tables
> > > > > >>>>> are resource-intensive and can impact its core performance. The
> > > > > >>>>> motivation for this new service is to decouple these I/O-heavy
> > > > > >>>>> background tasks from the main catalog, ensuring it remains 
> > > > > >>>>> highly
> > > > > >>>>> responsive while allowing the task execution workload to be 
> > > > > >>>>> managed
> > > > > >>>>> and scaled independently.
> > > > > >>>>>
> > > > > >>>>> 2. Proposal
> > > > > >>>>> We propose an optional, independent Delegation Service 
> > > > > >>>>> responsible for
> > > > > >>>>> executing these offloaded operations.
> > > > > >>>>> The MVP will focus on synchronously handling the data file 
> > > > > >>>>> deletion
> > > > > >>>>> process for DROP TABLE WITH PURGE commands.
> > > > > >>>>>
> > > > > >>>>> 3. Relationship to the "Asynchronous & Reliable Tasks" Proposal
> > > > > >>>>> This proposal is designed to be highly synergistic with the 
> > > > > >>>>> existing
> > > > > >>>>> "Asynchronous & Reliable Tasks" proposal.
> > > > > >>>>>
> > > > > >>>>> The Asynchronous Task proposal describes a general internal 
> > > > > >>>>> framework
> > > > > >>>>> for reliably scheduling and managing the lifecycle of any task 
> > > > > >>>>> within
> > > > > >>>>> Polaris. On the other hand, this proposal defines a specific, 
> > > > > >>>>> external
> > > > > >>>>> worker service optimized for executing a particular class of 
> > > > > >>>>> I/O-heavy
> > > > > >>>>> tasks.
> > > > > >>>>>
> > > > > >>>>> The Delegation Service does not alter the core Polaris task 
> > > > > >>>>> schema.
> > > > > >>>>> This allows it to seamlessly act as a specialized "backend" 
> > > > > >>>>> worker
> > > > > >>>>> that can execute tasks scheduled and managed by the more 
> > > > > >>>>> advanced
> > > > > >>>>> Asynchronous Task Framework, which would serve as the reliable
> > > > > >>>>> "frontend." This relationship is explored further in section 
> > > > > >>>>> 10.2 of
> > > > > >>>>> the document.
> > > > > >>>>>
> > > > > >>>>> Please find the detailed design document here for review:
> > > > > >>>>> -
> > > > > >>>>>
> > > > > >>>>
> > > > > >>> https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?usp=sharing
> > > > > >>>>>
> > > > > >>>>> Best Regards,
> > > > > >>>>> William
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >

Re: [DISCUSS] Polaris Delegation Service for Long-Running Tasks

Reply via email to