Re: [DISCUSS] FLIP-588: Support per-job delegation tokens

Alan Sheinberg via dev Tue, 23 Jun 2026 15:04:32 -0700

Hi Gabor,

Thanks for the thoughtful comments. I just wanted to chime in on some of
the thinking Aleksandr and I have had.

 Up until now DelegationTokenProvider instances were singletons and loaded
> by the
> service loader. Now we plan to add stop function, does that mean we plan
> to change
> the lifecycle?

No, the lifecycle is unchanged.  It was imagined that this would be a
useful hook for potentially cleaning things up, if necessary.  Sometimes
thread pools or other resources might need to be shut down neatly.

Having a generic way to ask the delegation token manager to re-obtain is a
> long standing
> needed feature but didn't have time. Having a dedicated API for this would
> be maybe
> better instead of relying on registerJob return value.

I agree that a general API for managing re-obtain makes sense.  Generally
the DelegationTokenProvider would likely request a re-obtain in response to
some event.  Currently, obtainDelegationTokens() is the main hook that
fetches tokens and determines when it will be called again.  Another
possibility could be a background thread that requests it, or the new
registerJob/unregisterJob methods being proposed.

A quick sketch of a possible generic interface:

public interface DelegationTokenManagerCallback {
   void reobtainDelegationTokens();
 }

We could then overload the init method of DelegationTokenProvider and
have init(Configuration
config, DelegationTokenManagerCallback callback)so that the
DelegationTokenProvider could keep a reference to the callback and initiate
a re-obtain at will (causing a new refresh on
the DefaultDelegationTokenManager's ioExecutor).  The callback logic would
need to be smart about deduping calls so that only one was scheduled at a
time in a threadsafe way.

This method could then be utilized by the body of any registerJob, allowing
the method to have a void return value.

That approach is simple and could be extended in the future if you have
some broader ideas on other parts of the api.  Would you rather implement
this approach and avoid adding a special case to registerJob?

Not sure sure how it's planned but new immediate re-obtain scheduling would
> be good to be
> upper bounded. Some retry logic can be aggressive about re-registration.
> Or having a
> cooldown is also fine.

That makes sense to have some cool down to avoid doing it too often,
however, a job might not run if it cannot initiate a re-obtain soon after
being registered.  A configurable cooldown with a decent default might be a
good choice.

Last but not least up until now there was a single thread which played on
> critical path on
> immutable structures. Now we plan to change that which is fine but then I
> would like to see an
> exact plan what kind of threads are doing what and how do we protect
> against
> race/starvation/deadlock. Having an exact look is fine on the PR but this
> is the gist of it
> from my perspective.
>

In the current codebase a single thread creates
the DefaultDelegationTokenManager and builds the immutable structures. Then
DefaultDelegationTokenManager.start is called from the ResourceManager main
thread and each token re-obtain is called on a thread
from DefaultDelegationTokenManager.ioExecutor.  Therefore, fields within a
DelegationTokenProvider must either be immutable or properly synchronized.

The calls to registerJob/unregisterJob in this FLIP will come from the
ResourceManager main thread, calling through to
DefaultDelegationTokenManager and then the providers.  They are assumed to
be non blocking and just handle book-keeping for the next re-obtain call.
Since this pattern inherently requires updating internal fields, the
DelegationTokenProvider must properly synchronize the methods/fields used
for this book-keeping.  Calls to registerJob/unregisterJob aren't prevented
from blocking and starving others, similar to obtainDelegationTokens.  The
contract can be made very clear in the javadoc.  Preventing races,
starvation, or deadlock within the provider will therefore depend on proper
implementation by the user.

A larger reworking of DefaultDelegationTokenManager could try to do
everything on a single thread
(registerJob/unregisterJob/obtainDelegationTokens) to simplify this model,
but would require using a special background thread rather than the
ioExecutor.  I haven't considered this in detail, but would be open to it
if it were strongly preferred.

What I mean here specifically is that even if we schedule the renewal the
> existing way
> at least the providers list manipulation and the originally scheduled
> renewal can race.
> Maybe others since I can just imagine the change.

I don't think we intend on changing the list of providers -- these are
still immutable.   Whenever a new re-obtain is requested, it should cancel
the originally scheduled renewal using the future as in
DefaultDelegationTokenManager.stopTokensUpdate, ensuring just one update
scheduled at a time.

I hope I have answered a lot of your questions.  I'm happy to elaborate or
even show a draft PR if that might be easier to trace.

Thanks,
Alan

On Fri, Jun 19, 2026 at 7:50 AM Gabor Somogyi <[email protected]>
wrote:

> Hi Aleksandr,
>
> Thanks for efforts!
>
> I've missed this thread lately but have some thought/questions.
>
> Up until now one cluster per one set of user credentials was the model. I
> think the multi-user
> model better serves the needs so +1. We should mention this on the main
> doc page later.
>
> Up until now DelegationTokenProvider instances were singletons and loaded
> by the
> service loader. Now we plan to add stop function, does that mean we plan
> to change
> the lifecycle?
>
> Having a generic way to ask the delegation token manager to re-obtain is a
> long standing
> needed feature but didn't have time. Having a dedicated API for this would
> be maybe
> better instead of relying on registerJob return value.
>
> Not sure sure how it's planned but new immediate re-obtain scheduling
> would be good to be
> upper bounded. Some retry logic can be aggressive about re-registration.
> Or having a
> cooldown is also fine.
>
> Last but not least up until now there was a single thread which played on
> critical path on
> immutable structures. Now we plan to change that which is fine but then I
> would like to see an
> exact plan what kind of threads are doing what and how do we protect
> against
> race/starvation/deadlock. Having an exact look is fine on the PR but this
> is the gist of it
> from my perspective.
> What I mean here specifically is that even if we schedule the renewal the
> existing way
> at least the providers list manipulation and the originally scheduled
> renewal can race.
> Maybe others since I can just imagine the change.
>
> BR,
> G
>
>
> On 2026/06/05 16:35:15 Aleksandr Savonin wrote:
> > Hi everyone,
> >
> > Alan Sheinberg and I would like to start a discussion on FLIP-588:
> > Support per-job delegation tokens [1].
> > Flink's delegation token framework is currently cluster-scoped, which
> > means a DelegationTokenProvider has no notion of an individual job.
> > This breaks when different jobs on the same cluster need to
> > authenticate as different identities to the same external service.
> > To resolve this, the FLIP adds per-job lifecycle hooks
> > (registerJob/unregisterJob/stop) as default methods on the
> > DelegationTokenProvider SPI, along with the runtime wiring to invoke
> > them on job start and stop.
> > This change is fully backward compatible (new methods are default
> > no-ops). It is worth mentioning that it widens the internal
> > registerJobMaster RPC to carry the job configuration.
> >
> > Looking forward to your feedback.
> >
> > [1]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-588*3A*Support*per-job*delegation*tokens__;JSsrKys!!Ayb5sqE7!pujTCGQDxHRMUp32hJP7kWS_heNDLb_73xOFQWmfwladcejJ1XJF028lAWmhEubAIfREamAXhXe0ImcLzn1TBQ9SvZl-ww$
> >
> > --
> > Kind regards,
> > Aleksandr
> >
>

Re: [DISCUSS] FLIP-588: Support per-job delegation tokens

Reply via email to