Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

David Morávek Fri, 28 Jan 2022 06:52:47 -0800

Hi,

AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse ..
>


I think you're right, with a careful design around threading (delegating
update broadcasts to the main thread) + synchronous initial update (that
would be nice to avoid) this should be doable.

Not sure what you mean "we can't register the TM without providing it with
> token" but in unsecure configuration registration must happen w/o tokens.
>

Exactly as you describe it, this was meant only for the "kerberized /
secured" cluster case, in other cases we wouldn't enforce a non-null token
in the response

I think this is a good idea in general.
>

+1

If you don't have any more thoughts on the RPC / lifecycle part, can you
please reflect it into the FLIP?

D.

On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <[email protected]>
wrote:

> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> sorted on TM side)
>
> AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse
> is processed which would contain the initial tokens. If that's true then
> how is it possible to have race with
> DTM update which is working on the registered TMs list?
> To be more specific "taskExecutors" is the registered map of TMs to which
> DTM can send updated tokens
> but this doesn't contain the under registration TM while
> RegistrationResponse is not processed, right?
>
> Of course if DTM can update while RegistrationResponse is processed then
> somehow sorting would be
> required and that case I would agree.
>
> - Scope DT updates by the RM ID and ensure that TM only accepts update from
> the current leader
>
> I've planned this initially the mentioned way so agreed.
>
> - Return initial token with the RegistrationResponse, which should make the
> RPC contract bit clearer (ensure that we can't register the TM without
> providing it with token)
>
> I think this is a good idea in general. Not sure what you mean "we can't
> register the TM without
> providing it with token" but in unsecure configuration registration must
> happen w/o tokens.
> All in all the newly added tokens field must be somehow optional.
>
> G
>
>
> On Fri, Jan 28, 2022 at 2:22 PM David Morávek <[email protected]> wrote:
>
> > We had a long discussion with Chesnay about the possible edge cases and
> it
> > basically boils down to the following two scenarios:
> >
> > 1) There is a possible race condition between TM registration (the first
> DT
> > update) and token refresh if they happen simultaneously. Than the
> > registration might beat the refreshed token. This could be easily
> addressed
> > if DTs could be sorted (eg. by the expiration time) on the TM side. In
> > other words, if there are multiple updates at the same time we need to
> make
> > sure that we have a deterministic way of choosing the latest one.
> >
> > One idea by Chesnay that popped up during this discussion was whether we
> > could simply return the initial token with the RegistrationResponse to
> > avoid making an extra call during the TM registration.
> >
> > 2) When the RM leadership changes (eg. because zookeeper session times
> out)
> > there might be a race condition where the old RM is shutting down and
> > updates the tokens, that it might again beat the registration token of
> the
> > new RM. This could be avoided if we scope the token by
> _ResourceManagerId_
> > and only accept updates for the current leader (basically we'd have an
> > extra parameter to the _updateDelegationToken_ method).
> >
> > -
> >
> > DTM is way simpler then for example slot management, which could receive
> > updates from the JobMaster that RM might not know about.
> >
> > So if you want to go in the path you're describing it should be doable
> and
> > we'd propose following to cover all cases:
> >
> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> > sorted on TM side)
> > - Scope DT updates by the RM ID and ensure that TM only accepts update
> from
> > the current leader
> > - Return initial token with the RegistrationResponse, which should make
> the
> > RPC contract bit clearer (ensure that we can't register the TM without
> > providing it with token)
> >
> > Any thoughts?
> >
> >
> > On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <
> [email protected]>
> > wrote:
> >
> > > Thanks for investing your time!
> > >
> > > The first 2 bulletpoint are clear.
> > > If there is a chance that a TM can go to an inconsistent state then I
> > agree
> > > with the 3rd bulletpoint.
> > > Just before we agree on that I would like to learn something new and
> > > understand how is it possible that a TM
> > > gets corrupted? (In Spark I've never seen such thing and no mechanism
> to
> > > fix this but Flink is definitely not Spark)
> > >
> > > Here is my understanding:
> > > * DTM pushes new obtained DTs to TMs and if any exception occurs then a
> > > retry after "security.kerberos.tokens.retry-wait"
> > > happens. This means DTM retries until it's not possible to send new DTs
> > to
> > > all registered TMs.
> > > * New TM registration must fail if "updateDelegationToken" fails
> > > * "updateDelegationToken" fails consistently like a DB (at least I plan
> > to
> > > implement it that way).
> > > If DTs are arriving on the TM side then a single
> > > "UserGroupInformation.getCurrentUser.addCredentials"
> > > will be called which I've never seen it failed.
> > > * I hope all other code parts are not touching existing DTs within the
> > JVM
> > >
> > > I would like to emphasize I'm not against to add it just want to see
> what
> > > kind of problems are we facing.
> > > It would ease to catch bugs earlier and help in the maintenance.
> > >
> > > All in all I would buy the idea to add the 3rd bullet if we foresee the
> > > need.
> > >
> > > G
> > >
> > >
> > > On Fri, Jan 28, 2022 at 10:07 AM David Morávek <[email protected]>
> wrote:
> > >
> > > > Hi Gabor,
> > > >
> > > > This is definitely headed in a right direction +1.
> > > >
> > > > I think we still need to have a safeguard in case some of the TMs
> gets
> > > into
> > > > the inconsistent state though, which will also eliminate the need for
> > > > implementing a custom retry mechanism (when _updateDelegationToken_
> > call
> > > > fails for some reason).
> > > >
> > > > We already have this safeguard in place for slot pool (in case there
> > are
> > > > some slots in inconsistent state - eg. we haven't freed them for some
> > > > reason) and for the partition tracker, which could be simply
> enhanced.
> > > This
> > > > is done via periodic heartbeat from TaskManagers to the
> ResourceManager
> > > > that contains report about state of these two components (from TM
> > > > perspective) so the RM can reconcile their state if necessary.
> > > >
> > > > I don't think adding an additional field to
> > > _TaskExecutorHeartbeatPayload_
> > > > should be a concern as we only heartbeat every ~ 10s by default and
> the
> > > new
> > > > field would be small compared to rest of the existing payload. Also
> > > > heartbeat doesn't need to contain the whole DT, but just some
> > identifier
> > > > which signals whether it uses the right one, that could be
> > significantly
> > > > smaller.
> > > >
> > > > This is still a PUSH based approach as the RM would again call the
> > newly
> > > > introduced _updateDelegationToken_ when it encounters inconsistency
> > (eg.
> > > > due to a temporary network partition / a race condition we didn't
> test
> > > for
> > > > / some other scenario we didn't think about). In practice these
> > > > inconsistencies are super hard to avoid and reason about (and
> > > unfortunately
> > > > yes, we see them happen from time to time), so reusing the existing
> > > > mechanism that is designed for this exact problem simplify things.
> > > >
> > > > To sum this up we'd have three code paths for calling
> > > > _updateDelegationToken_:
> > > > 1) When the TM registers, we push the token (if DTM already has it)
> to
> > it
> > > > 2) When DTM obtains a new token it broadcasts it to all currently
> > > connected
> > > > TMs
> > > > 3) When a TM gets out of sync, DTM would reconcile it's state
> > > >
> > > > WDYT?
> > > >
> > > > Best,
> > > > D.
> > > >
> > > >
> > > > On Wed, Jan 26, 2022 at 9:03 PM David Morávek <[email protected]>
> wrote:
> > > >
> > > > > Thanks the update, I'll go over it tomorrow.
> > > > >
> > > > > On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> Since it has turned out that DTM can't be added as member of
> > JobMaster
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > >> >
> > > > >> I've
> > > > >> came up with a better proposal.
> > > > >> David, thanks for pinpointing this out, you've caught a bug in the
> > > early
> > > > >> phase!
> > > > >>
> > > > >> Namely ResourceManager
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> > > > >> >
> > > > >> is
> > > > >> a single instance class where DTM can be added as member variable.
> > > > >> It has a list of all already registered TMs and new TM
> registration
> > is
> > > > >> also
> > > > >> happening here.
> > > > >> The following can be added from logic perspective to be more
> > specific:
> > > > >> * Create new DTM instance in ResourceManager
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> > > > >> >
> > > > >> and
> > > > >> start it (re-occurring thread to obtain new tokens)
> > > > >> * Add a new function named "updateDelegationTokens" to
> > > > TaskExecutorGateway
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
> > > > >> >
> > > > >> * Call "updateDelegationTokens" on all registered TMs to propagate
> > new
> > > > DTs
> > > > >> * In case of new TM registration call "updateDelegationTokens"
> > before
> > > > >> registration succeeds to setup new TM properly
> > > > >>
> > > > >> This way:
> > > > >> * only a single DTM would live within a cluster which is the
> > expected
> > > > >> behavior
> > > > >> * DTM is going to be added to a central place where all deployment
> > > > target
> > > > >> can make use of it
> > > > >> * DTs are going to be pushed to TMs which would generate less
> > network
> > > > >> traffic than pull based approach
> > > > >> (please see my previous mail where I've described both approaches)
> > > > >> * HA scenario is going to be consistent because such
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
> > > > >> >
> > > > >> a solution can be added to "updateDelegationTokens"
> > > > >>
> > > > >> @David or all others plz share whether you agree on this or you
> have
> > > > >> better
> > > > >> idea/suggestion.
> > > > >>
> > > > >> BR,
> > > > >> G
> > > > >>
> > > > >>
> > > > >> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi <
> > > > [email protected]
> > > > >> >
> > > > >> wrote:
> > > > >>
> > > > >> > First of all thanks for investing your time and helping me out.
> > As I
> > > > see
> > > > >> > you have pretty solid knowledge in the RPC area.
> > > > >> > I would like to rely on your knowledge since I'm learning this
> > part.
> > > > >> >
> > > > >> > > - Do we need to introduce a new RPC method or can we for
> example
> > > > >> > piggyback
> > > > >> > on heartbeats?
> > > > >> >
> > > > >> > I'm fine with either solution but one thing is important
> > > conceptually.
> > > > >> > There are fundamentally 2 ways how tokens can be updated:
> > > > >> > - Push way: When there are new DTs then JM JVM pushes DTs to TM
> > > JVMs.
> > > > >> This
> > > > >> > is the preferred one since tiny amount of control logic needed.
> > > > >> > - Pull way: Each time a TM would like to poll JM whether there
> are
> > > new
> > > > >> > tokens and each TM wants to decide alone whether DTs needs to be
> > > > >> updated or
> > > > >> > not.
> > > > >> > As you've mentioned here some ID needs to be generated, it would
> > > > >> generated
> > > > >> > quite some additional network traffic which can be definitely
> > > avoided.
> > > > >> > As a final thought in Spark we've had this way of DT propagation
> > > logic
> > > > >> and
> > > > >> > we've had major issues with it.
> > > > >> >
> > > > >> > So all in all DTM needs to obtain new tokens and there must a
> way
> > to
> > > > >> send
> > > > >> > this data to all TMs from JM.
> > > > >> >
> > > > >> > > - What delivery semantics are we looking for? (what if we're
> > only
> > > > >> able to
> > > > >> > update subset of TMs / what happens if we exhaust retries /
> should
> > > we
> > > > >> even
> > > > >> > have the retry mechanism whatsoever) - I have a feeling that
> > somehow
> > > > >> > leveraging the existing heartbeat mechanism could help to answer
> > > these
> > > > >> > questions
> > > > >> >
> > > > >> > Let's go through these questions one by one.
> > > > >> > > What delivery semantics are we looking for?
> > > > >> >
> > > > >> > DTM must receive an exception when at least one TM was not able
> to
> > > get
> > > > >> DTs.
> > > > >> >
> > > > >> > > what if we're only able to update subset of TMs?
> > > > >> >
> > > > >> > Such case DTM will reschedule token obtain after
> > > > >> > "security.kerberos.tokens.retry-wait" time.
> > > > >> >
> > > > >> > > what happens if we exhaust retries?
> > > > >> >
> > > > >> > There is no number of retries. In default configuration tokens
> > needs
> > > > to
> > > > >> be
> > > > >> > re-obtained after one day.
> > > > >> > DTM tries to obtain new tokens after 1day * 0.75
> > > > >> > (security.kerberos.tokens.renewal-ratio) = 18 hours.
> > > > >> > When fails it retries after
> "security.kerberos.tokens.retry-wait"
> > > > which
> > > > >> is
> > > > >> > 1 hour by default.
> > > > >> > If it never succeeds then authentication error is going to
> happen
> > on
> > > > the
> > > > >> > TM side and the workload is
> > > > >> > going to stop.
> > > > >> >
> > > > >> > > should we even have the retry mechanism whatsoever?
> > > > >> >
> > > > >> > Yes, because there are always temporary cluster issues.
> > > > >> >
> > > > >> > > What does it mean for the running application (how does this
> > look
> > > > like
> > > > >> > from
> > > > >> > the user perspective)? As far as I remember the logs are only
> > > > collected
> > > > >> > ("aggregated") after the container is stopped, is that correct?
> > > > >> >
> > > > >> > With default config it works like that but it can be forced to
> > > > aggregate
> > > > >> > at specific intervals.
> > > > >> > A useful feature is forcing YARN to aggregate logs while the job
> > is
> > > > >> still
> > > > >> > running.
> > > > >> > For long-running jobs such as streaming jobs, this is
> invaluable.
> > To
> > > > do
> > > > >> > this,
> > > > >> >
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
> > > must
> > > > >> be
> > > > >> > set to a non-negative value.
> > > > >> > When this is set, a timer will be set for the given duration,
> and
> > > > >> whenever
> > > > >> > that timer goes off,
> > > > >> > log aggregation will run on new files.
> > > > >> >
> > > > >> > > I think
> > > > >> > this topic should get its own section in the FLIP (having some
> > cross
> > > > >> > reference to YARN ticket would be really useful, but I'm not
> sure
> > if
> > > > >> there
> > > > >> > are any).
> > > > >> >
> > > > >> > I think this is important knowledge but this FLIP is not
> touching
> > > the
> > > > >> > already existing behavior.
> > > > >> > DTs are set on the AM container which is renewed by YARN until
> > it's
> > > > not
> > > > >> > possible anymore.
> > > > >> > Any kind of new code is not going to change this limitation.
> BTW,
> > > > there
> > > > >> is
> > > > >> > no jira for this.
> > > > >> > If you think it worth to write this down then I think the good
> > place
> > > > is
> > > > >> > the official security doc
> > > > >> > area as caveat.
> > > > >> >
> > > > >> > > If we split the FLIP into two parts / sections that I've
> > > suggested,
> > > > I
> > > > >> > don't
> > > > >> > really think that you need to explicitly test for each
> deployment
> > > > >> scenario
> > > > >> > / cluster framework, because the DTM part is completely
> > independent
> > > of
> > > > >> the
> > > > >> > deployment target. Basically this is what I'm aiming for with
> > > "making
> > > > it
> > > > >> > work with the standalone" (as simple as starting a new java
> > process)
> > > > >> Flink
> > > > >> > first (which is also how most people deploy streaming
> application
> > on
> > > > k8s
> > > > >> > and the direction we're pushing forward with the auto-scaling /
> > > > reactive
> > > > >> > mode initiatives).
> > > > >> >
> > > > >> > I see your point and agree the main direction. k8s is the
> > megatrend
> > > > >> which
> > > > >> > most of the peoples
> > > > >> > will use sooner or later. Not 100% sure what kind of split you
> > > suggest
> > > > >> but
> > > > >> > in my view
> > > > >> > the main target is to add this feature and I'm open to any
> logical
> > > > work
> > > > >> > ordering.
> > > > >> > Please share the specific details and we work it out...
> > > > >> >
> > > > >> > G
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 24, 2022 at 3:04 PM David Morávek <[email protected]>
> > > > wrote:
> > > > >> >
> > > > >> >> >
> > > > >> >> > Could you point to a code where you think it could be added
> > > > exactly?
> > > > >> A
> > > > >> >> > helping hand is welcome here 🙂
> > > > >> >> >
> > > > >> >>
> > > > >> >> I think you can take a look at
> _ResourceManagerPartitionTracker_
> > > [1]
> > > > >> which
> > > > >> >> seems to have somewhat similar properties to the DTM.
> > > > >> >>
> > > > >> >> One topic that needs to be addressed there is how the RPC with
> > the
> > > > >> >> _TaskExecutorGateway_ should look like.
> > > > >> >> - Do we need to introduce a new RPC method or can we for
> example
> > > > >> piggyback
> > > > >> >> on heartbeats?
> > > > >> >> - What delivery semantics are we looking for? (what if we're
> only
> > > > able
> > > > >> to
> > > > >> >> update subset of TMs / what happens if we exhaust retries /
> > should
> > > we
> > > > >> even
> > > > >> >> have the retry mechanism whatsoever) - I have a feeling that
> > > somehow
> > > > >> >> leveraging the existing heartbeat mechanism could help to
> answer
> > > > these
> > > > >> >> questions
> > > > >> >>
> > > > >> >> In short, after DT reaches it's max lifetime then log
> aggregation
> > > > stops
> > > > >> >> >
> > > > >> >>
> > > > >> >> What does it mean for the running application (how does this
> look
> > > > like
> > > > >> >> from
> > > > >> >> the user perspective)? As far as I remember the logs are only
> > > > collected
> > > > >> >> ("aggregated") after the container is stopped, is that
> correct? I
> > > > think
> > > > >> >> this topic should get its own section in the FLIP (having some
> > > cross
> > > > >> >> reference to YARN ticket would be really useful, but I'm not
> sure
> > > if
> > > > >> there
> > > > >> >> are any).
> > > > >> >>
> > > > >> >> All deployment modes (per-job, per-app, ...) are planned to be
> > > tested
> > > > >> and
> > > > >> >> > expect to work with the initial implementation however not
> all
> > > > >> >> deployment
> > > > >> >> > targets (k8s, local, ...
> > > > >> >> >
> > > > >> >>
> > > > >> >> If we split the FLIP into two parts / sections that I've
> > > suggested, I
> > > > >> >> don't
> > > > >> >> really think that you need to explicitly test for each
> deployment
> > > > >> scenario
> > > > >> >> / cluster framework, because the DTM part is completely
> > independent
> > > > of
> > > > >> the
> > > > >> >> deployment target. Basically this is what I'm aiming for with
> > > "making
> > > > >> it
> > > > >> >> work with the standalone" (as simple as starting a new java
> > > process)
> > > > >> Flink
> > > > >> >> first (which is also how most people deploy streaming
> application
> > > on
> > > > >> k8s
> > > > >> >> and the direction we're pushing forward with the auto-scaling /
> > > > >> reactive
> > > > >> >> mode initiatives).
> > > > >> >>
> > > > >> >> The whole integration with YARN (let's forget about log
> > aggregation
> > > > >> for a
> > > > >> >> moment) / k8s-native only boils down to how do we make the
> keytab
> > > > file
> > > > >> >> local to the JobManager so the DTM can read it, so it's
> basically
> > > > >> built on
> > > > >> >> top of that. The only special thing that needs to be tested
> there
> > > is
> > > > >> the
> > > > >> >> "keytab distribution" code path.
> > > > >> >>
> > > > >> >> [1]
> > > > >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> D.
> > > > >> >>
> > > > >> >> On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi <
> > > > >> [email protected]
> > > > >> >> >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > > There is a separate JobMaster for each job
> > > > >> >> > within a Flink cluster and each JobMaster only has a partial
> > view
> > > > of
> > > > >> the
> > > > >> >> > task managers
> > > > >> >> >
> > > > >> >> > Good point! I've had a deeper look and you're right. We
> > > definitely
> > > > >> need
> > > > >> >> to
> > > > >> >> > find another place.
> > > > >> >> >
> > > > >> >> > > Related per-cluster or per-job keytab:
> > > > >> >> >
> > > > >> >> > In the current code per-cluster keytab is implemented and I'm
> > > > >> intended
> > > > >> >> to
> > > > >> >> > keep it like this within this FLIP. The reason is simple:
> > tokens
> > > on
> > > > >> TM
> > > > >> >> side
> > > > >> >> > can be stored within the UserGroupInformation (UGI) structure
> > > which
> > > > >> is
> > > > >> >> > global. I'm not telling it's impossible to change that but I
> > > think
> > > > >> that
> > > > >> >> > this is such a complexity which the initial implementation is
> > not
> > > > >> >> required
> > > > >> >> > to contain. Additionally we've not seen such need from user
> > side.
> > > > If
> > > > >> the
> > > > >> >> > need may rise later on then another FLIP with this topic can
> be
> > > > >> created
> > > > >> >> and
> > > > >> >> > discussed. Proper multi-UGI handling within a single JVM is a
> > > topic
> > > > >> >> where
> > > > >> >> > several round of deep-dive with the Hadoop/YARN guys are
> > > required.
> > > > >> >> >
> > > > >> >> > > single DTM instance embedded with
> > > > >> >> > the ResourceManager (the Flink component)
> > > > >> >> >
> > > > >> >> > Could you point to a code where you think it could be added
> > > > exactly?
> > > > >> A
> > > > >> >> > helping hand is welcome here🙂
> > > > >> >> >
> > > > >> >> > > Then the single (initial) implementation should work with
> all
> > > the
> > > > >> >> > deployments modes out of the box (which is not what the FLIP
> > > > >> suggests).
> > > > >> >> Is
> > > > >> >> > that correct?
> > > > >> >> >
> > > > >> >> > All deployment modes (per-job, per-app, ...) are planned to
> be
> > > > tested
> > > > >> >> and
> > > > >> >> > expect to work with the initial implementation however not
> all
> > > > >> >> deployment
> > > > >> >> > targets (k8s, local, ...) are not intended to be tested. Per
> > > > >> deployment
> > > > >> >> > target new jira needs to be created where I expect small
> number
> > > of
> > > > >> codes
> > > > >> >> > needs to be added and relatively expensive testing effort is
> > > > >> required.
> > > > >> >> >
> > > > >> >> > > I've taken a look into the prototype and in the
> > > > >> >> "YarnClusterDescriptor"
> > > > >> >> > you're injecting a delegation token into the AM [1] (that's
> > > > obtained
> > > > >> >> using
> > > > >> >> > the provided keytab). If I understand this correctly from
> > > previous
> > > > >> >> > discussion / FLIP, this is to support log aggregation and DT
> > has
> > > a
> > > > >> >> limited
> > > > >> >> > validity. How is this DT going to be renewed?
> > > > >> >> >
> > > > >> >> > You're clever and touched a limitation which Spark has too.
> In
> > > > short,
> > > > >> >> after
> > > > >> >> > DT reaches it's max lifetime then log aggregation stops. I've
> > had
> > > > >> >> several
> > > > >> >> > deep-dive rounds with the YARN guys at Spark years because
> > wanted
> > > > to
> > > > >> >> fill
> > > > >> >> > this gap. They can't provide us any way to re-inject the
> newly
> > > > >> obtained
> > > > >> >> DT
> > > > >> >> > so at the end I gave up this.
> > > > >> >> >
> > > > >> >> > BR,
> > > > >> >> > G
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Mon, 24 Jan 2022, 11:00 David Morávek, <[email protected]>
> > > wrote:
> > > > >> >> >
> > > > >> >> > > Hi Gabor,
> > > > >> >> > >
> > > > >> >> > > There is actually a huge difference between JobManager
> > > (process)
> > > > >> and
> > > > >> >> > > JobMaster (job coordinator). The naming is unfortunately
> bit
> > > > >> >> misleading
> > > > >> >> > > here from historical reasons. There is a separate JobMaster
> > for
> > > > >> each
> > > > >> >> job
> > > > >> >> > > within a Flink cluster and each JobMaster only has a
> partial
> > > view
> > > > >> of
> > > > >> >> the
> > > > >> >> > > task managers (depends on where the slots for a particular
> > job
> > > > are
> > > > >> >> > > allocated). This means that you'll end up with N
> > > > >> >> > "DelegationTokenManagers"
> > > > >> >> > > competing with each other (N = number of running jobs in
> the
> > > > >> cluster).
> > > > >> >> > >
> > > > >> >> > > This makes me think we're mixing two abstraction levels
> here:
> > > > >> >> > >
> > > > >> >> > > a) Per-cluster delegation tokens
> > > > >> >> > > - Simpler approach, it would involve a single DTM instance
> > > > embedded
> > > > >> >> with
> > > > >> >> > > the ResourceManager (the Flink component)
> > > > >> >> > > b) Per-job delegation tokens
> > > > >> >> > > - More complex approach, but could be more flexible from
> the
> > > user
> > > > >> >> side of
> > > > >> >> > > things.
> > > > >> >> > > - Multiple DTM instances, that are bound with the JobMaster
> > > > >> lifecycle.
> > > > >> >> > > Delegation tokens are attached with a particular slots that
> > are
> > > > >> >> executing
> > > > >> >> > > the job tasks instead of the whole task manager (TM could
> be
> > > > >> executing
> > > > >> >> > > multiple jobs with different tokens).
> > > > >> >> > > - The question is which keytab should be used for the
> > > clustering
> > > > >> >> > framework,
> > > > >> >> > > to support log aggregation on YARN (an extra keytab, keytab
> > > that
> > > > >> comes
> > > > >> >> > with
> > > > >> >> > > the first job?)
> > > > >> >> > >
> > > > >> >> > > I think these are the things that need to be clarified in
> the
> > > > FLIP
> > > > >> >> before
> > > > >> >> > > proceeding.
> > > > >> >> > >
> > > > >> >> > > A follow-up question for getting a better understanding
> where
> > > > this
> > > > >> >> should
> > > > >> >> > > be headed: Are there any use cases where user may want to
> use
> > > > >> >> different
> > > > >> >> > > keytabs with each job, or are we fine with using a
> > cluster-wide
> > > > >> >> keytab?
> > > > >> >> > If
> > > > >> >> > > we go with per-cluster keytabs, is it OK that all jobs
> > > submitted
> > > > >> into
> > > > >> >> > this
> > > > >> >> > > cluster can access it (even the future ones)? Should this
> be
> > a
> > > > >> >> security
> > > > >> >> > > concern?
> > > > >> >> > >
> > > > >> >> > > Presume you though I would implement a new class with
> > > JobManager
> > > > >> name.
> > > > >> >> > The
> > > > >> >> > > > plan is not that.
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> > > I've never suggested such thing.
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > > No. That said earlier DT handling is planned to be done
> > > > >> completely
> > > > >> >> in
> > > > >> >> > > > Flink. DTM has a renewal thread which re-obtains tokens
> in
> > > the
> > > > >> >> proper
> > > > >> >> > > time
> > > > >> >> > > > when needed.
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> > > Then the single (initial) implementation should work with
> all
> > > the
> > > > >> >> > > deployments modes out of the box (which is not what the
> FLIP
> > > > >> >> suggests).
> > > > >> >> > Is
> > > > >> >> > > that correct?
> > > > >> >> > >
> > > > >> >> > > If the cluster framework, also requires delegation token
> for
> > > > their
> > > > >> >> inner
> > > > >> >> > > working (this is IMO only applies to YARN), it might need
> an
> > > > extra
> > > > >> >> step
> > > > >> >> > > (injecting the token into application master container).
> > > > >> >> > >
> > > > >> >> > > Separating the individual layers (actual Flink cluster -
> > > > basically
> > > > >> >> making
> > > > >> >> > > this work with a standalone deployment  / "cluster
> > framework" -
> > > > >> >> support
> > > > >> >> > for
> > > > >> >> > > YARN log aggregation) in the FLIP would be useful.
> > > > >> >> > >
> > > > >> >> > > Reading the linked Spark readme could be useful.
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> > > I've read that, but please be patient with the questions,
> > > > Kerberos
> > > > >> is
> > > > >> >> not
> > > > >> >> > > an easy topic to get into and I've had a very little
> contact
> > > with
> > > > >> it
> > > > >> >> in
> > > > >> >> > the
> > > > >> >> > > past.
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> > > I've taken a look into the prototype and in the
> > > > >> >> "YarnClusterDescriptor"
> > > > >> >> > > you're injecting a delegation token into the AM [1] (that's
> > > > >> obtained
> > > > >> >> > using
> > > > >> >> > > the provided keytab). If I understand this correctly from
> > > > previous
> > > > >> >> > > discussion / FLIP, this is to support log aggregation and
> DT
> > > has
> > > > a
> > > > >> >> > limited
> > > > >> >> > > validity. How is this DT going to be renewed?
> > > > >> >> > >
> > > > >> >> > > [1]
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/commit/8ab75e46013f159778ccfce52463e7bc63e395a9#diff-02416e2d6ca99e1456f9c3949f3d7c2ac523d3fe25378620c09632e4aac34e4eR1261
> > > > >> >> > >
> > > > >> >> > > Best,
> > > > >> >> > > D.
> > > > >> >> > >
> > > > >> >> > > On Fri, Jan 21, 2022 at 9:35 PM Gabor Somogyi <
> > > > >> >> [email protected]
> > > > >> >> > >
> > > > >> >> > > wrote:
> > > > >> >> > >
> > > > >> >> > > > Here is the exact class, I'm from mobile so not had a
> look
> > at
> > > > the
> > > > >> >> exact
> > > > >> >> > > > class name:
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > >> >> > > > That keeps track of TMs where the tokens can be sent to.
> > > > >> >> > > >
> > > > >> >> > > > > My feeling would be that we shouldn't really introduce
> a
> > > new
> > > > >> >> > component
> > > > >> >> > > > with
> > > > >> >> > > > a custom lifecycle, but rather we should try to
> incorporate
> > > > this
> > > > >> >> into
> > > > >> >> > > > existing ones.
> > > > >> >> > > >
> > > > >> >> > > > Can you be more specific? Presume you though I would
> > > implement
> > > > a
> > > > >> new
> > > > >> >> > > class
> > > > >> >> > > > with JobManager name. The plan is not that.
> > > > >> >> > > >
> > > > >> >> > > > > If I understand this correctly, this means that we then
> > > push
> > > > >> the
> > > > >> >> > token
> > > > >> >> > > > renewal logic to YARN.
> > > > >> >> > > >
> > > > >> >> > > > No. That said earlier DT handling is planned to be done
> > > > >> completely
> > > > >> >> in
> > > > >> >> > > > Flink. DTM has a renewal thread which re-obtains tokens
> in
> > > the
> > > > >> >> proper
> > > > >> >> > > time
> > > > >> >> > > > when needed. YARN log aggregation is a totally different
> > > > feature,
> > > > >> >> where
> > > > >> >> > > > YARN does the renewal. Log aggregation was an example why
> > the
> > > > >> code
> > > > >> >> > can't
> > > > >> >> > > be
> > > > >> >> > > > 100% reusable for all resource managers. Reading the
> linked
> > > > Spark
> > > > >> >> > readme
> > > > >> >> > > > could be useful.
> > > > >> >> > > >
> > > > >> >> > > > G
> > > > >> >> > > >
> > > > >> >> > > > On Fri, 21 Jan 2022, 21:05 David Morávek, <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >> > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > JobManager is the Flink class.
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > There is no such class in Flink. The closest thing to
> the
> > > > >> >> JobManager
> > > > >> >> > > is a
> > > > >> >> > > > > ClusterEntrypoint. The cluster entrypoint spawns new RM
> > > > Runner
> > > > >> &
> > > > >> >> > > > Dispatcher
> > > > >> >> > > > > Runner that start participating in the leader election.
> > > Once
> > > > >> they
> > > > >> >> > gain
> > > > >> >> > > > > leadership they spawn the actual underlying instances
> of
> > > > these
> > > > >> two
> > > > >> >> > > "main
> > > > >> >> > > > > components".
> > > > >> >> > > > >
> > > > >> >> > > > > My feeling would be that we shouldn't really introduce
> a
> > > new
> > > > >> >> > component
> > > > >> >> > > > with
> > > > >> >> > > > > a custom lifecycle, but rather we should try to
> > incorporate
> > > > >> this
> > > > >> >> into
> > > > >> >> > > > > existing ones.
> > > > >> >> > > > >
> > > > >> >> > > > > My biggest concerns would be:
> > > > >> >> > > > >
> > > > >> >> > > > > - How would the lifecycle of the new component look
> like
> > > with
> > > > >> >> regards
> > > > >> >> > > to
> > > > >> >> > > > HA
> > > > >> >> > > > > setups. If we really try to decide to introduce a
> > > completely
> > > > >> new
> > > > >> >> > > > component,
> > > > >> >> > > > > how should this work in case of multiple JobManager
> > > > instances?
> > > > >> >> > > > > - Which components does it talk to / how? For example
> how
> > > > does
> > > > >> the
> > > > >> >> > > > > broadcast of new token to task managers
> > > (TaskManagerGateway)
> > > > >> look
> > > > >> >> > like?
> > > > >> >> > > > Do
> > > > >> >> > > > > we simply introduce a new RPC on the
> > ResourceManagerGateway
> > > > >> that
> > > > >> >> > > > broadcasts
> > > > >> >> > > > > it or does the new component need to do some kind of
> > > > >> bookkeeping
> > > > >> >> of
> > > > >> >> > > task
> > > > >> >> > > > > managers that it needs to notify?
> > > > >> >> > > > >
> > > > >> >> > > > > YARN based HDFS log aggregation would not work by
> > dropping
> > > > that
> > > > >> >> code.
> > > > >> >> > > > Just
> > > > >> >> > > > > > to be crystal clear, the actual implementation
> contains
> > > > this
> > > > >> fir
> > > > >> >> > > > exactly
> > > > >> >> > > > > > this reason.
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > This is the missing part +1. If I understand this
> > > correctly,
> > > > >> this
> > > > >> >> > means
> > > > >> >> > > > > that we then push the token renewal logic to YARN. How
> do
> > > you
> > > > >> >> plan to
> > > > >> >> > > > > implement the renewal logic on k8s?
> > > > >> >> > > > >
> > > > >> >> > > > > D.
> > > > >> >> > > > >
> > > > >> >> > > > > On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi <
> > > > >> >> > > [email protected]
> > > > >> >> > > > >
> > > > >> >> > > > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > > I think we might both mean something different by
> the
> > > RM.
> > > > >> >> > > > > >
> > > > >> >> > > > > > You feel it well, I've not specified these terms well
> > in
> > > > the
> > > > >> >> > > > explanation.
> > > > >> >> > > > > > RM I meant resource management framework. JobManager
> is
> > > the
> > > > >> >> Flink
> > > > >> >> > > > class.
> > > > >> >> > > > > > This means that inside JM instance there will be a
> DTM
> > > > >> >> instance, so
> > > > >> >> > > > they
> > > > >> >> > > > > > would have the same lifecycle. Hope I've answered the
> > > > >> question.
> > > > >> >> > > > > >
> > > > >> >> > > > > > > If we have tokens available on the client side, why
> > do
> > > we
> > > > >> >> need to
> > > > >> >> > > set
> > > > >> >> > > > > > them
> > > > >> >> > > > > > into the AM (yarn specific concept) launch context?
> > > > >> >> > > > > >
> > > > >> >> > > > > > YARN based HDFS log aggregation would not work by
> > > dropping
> > > > >> that
> > > > >> >> > code.
> > > > >> >> > > > > Just
> > > > >> >> > > > > > to be crystal clear, the actual implementation
> contains
> > > > this
> > > > >> fir
> > > > >> >> > > > exactly
> > > > >> >> > > > > > this reason.
> > > > >> >> > > > > >
> > > > >> >> > > > > > G
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Fri, 21 Jan 2022, 20:12 David Morávek, <
> > > [email protected]
> > > > >
> > > > >> >> wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Hi Gabor,
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > 1. One thing is important, token management is
> > planned
> > > to
> > > > >> be
> > > > >> >> done
> > > > >> >> > > > > > > > generically within Flink and not scattered in RM
> > > > specific
> > > > >> >> code.
> > > > >> >> > > > > > > JobManager
> > > > >> >> > > > > > > > has a DelegationTokenManager which obtains tokens
> > > > >> >> time-to-time
> > > > >> >> > > (if
> > > > >> >> > > > > > > > configured properly). JM knows which TaskManagers
> > are
> > > > in
> > > > >> >> place
> > > > >> >> > so
> > > > >> >> > > > it
> > > > >> >> > > > > > can
> > > > >> >> > > > > > > > distribute it to all TMs. That's it basically.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > I think we might both mean something different by
> the
> > > RM.
> > > > >> >> > > JobManager
> > > > >> >> > > > is
> > > > >> >> > > > > > > basically just a process encapsulating multiple
> > > > components,
> > > > >> >> one
> > > > >> >> > of
> > > > >> >> > > > > which
> > > > >> >> > > > > > is
> > > > >> >> > > > > > > a ResourceManager, which is the component that
> > manages
> > > > task
> > > > >> >> > manager
> > > > >> >> > > > > > > registrations [1]. There is more or less a single
> > > > >> >> implementation
> > > > >> >> > of
> > > > >> >> > > > the
> > > > >> >> > > > > > RM
> > > > >> >> > > > > > > with plugable drivers for the active integrations
> > > (yarn,
> > > > >> k8s).
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > It would be great if you could share more details
> of
> > > how
> > > > >> >> exactly
> > > > >> >> > > the
> > > > >> >> > > > > DTM
> > > > >> >> > > > > > is
> > > > >> >> > > > > > > going to fit in the current JM architecture.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > 2. 99.9% of the code is generic but each RM handles
> > > > tokens
> > > > >> >> > > > > differently. A
> > > > >> >> > > > > > > > good example is YARN obtains tokens on client
> side
> > > and
> > > > >> then
> > > > >> >> > sets
> > > > >> >> > > > them
> > > > >> >> > > > > > on
> > > > >> >> > > > > > > > the newly created AM container launch context.
> This
> > > is
> > > > >> >> purely
> > > > >> >> > > YARN
> > > > >> >> > > > > > > specific
> > > > >> >> > > > > > > > and cant't be spared. With my actual plans
> > standalone
> > > > >> can be
> > > > >> >> > > > changed
> > > > >> >> > > > > to
> > > > >> >> > > > > > > use
> > > > >> >> > > > > > > > the framework. By using it I mean no RM specific
> > DTM
> > > or
> > > > >> >> > > whatsoever
> > > > >> >> > > > is
> > > > >> >> > > > > > > > needed.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > If we have tokens available on the client side, why
> > do
> > > we
> > > > >> >> need to
> > > > >> >> > > set
> > > > >> >> > > > > > them
> > > > >> >> > > > > > > into the AM (yarn specific concept) launch context?
> > Why
> > > > >> can't
> > > > >> >> we
> > > > >> >> > > > simply
> > > > >> >> > > > > > > send them to the JM, eg. as a parameter of the job
> > > > >> submission
> > > > >> >> /
> > > > >> >> > via
> > > > >> >> > > > > > > separate RPC call? There might be something I'm
> > missing
> > > > >> due to
> > > > >> >> > > > limited
> > > > >> >> > > > > > > knowledge, but handling the token on the "cluster
> > > > >> framework"
> > > > >> >> > level
> > > > >> >> > > > > > doesn't
> > > > >> >> > > > > > > seem necessary.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > [1]
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Best,
> > > > >> >> > > > > > > D.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi <
> > > > >> >> > > > > [email protected]
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > Oh and one more thing. I'm planning to add this
> > > feature
> > > > >> in
> > > > >> >> > small
> > > > >> >> > > > > chunk
> > > > >> >> > > > > > of
> > > > >> >> > > > > > > > PRs because security is super hairy area. That
> way
> > > > >> reviewers
> > > > >> >> > can
> > > > >> >> > > be
> > > > >> >> > > > > > more
> > > > >> >> > > > > > > > easily obtains the concept.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > On Fri, 21 Jan 2022, 18:03 David Morávek, <
> > > > >> [email protected]>
> > > > >> >> > > wrote:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Hi Gabor,
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > thanks for drafting the FLIP, I think having a
> > > solid
> > > > >> >> Kerberos
> > > > >> >> > > > > support
> > > > >> >> > > > > > > is
> > > > >> >> > > > > > > > > crucial for many enterprise deployments.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > I have multiple questions regarding the
> > > > implementation
> > > > >> >> (note
> > > > >> >> > > > that I
> > > > >> >> > > > > > > have
> > > > >> >> > > > > > > > > very limited knowledge of Kerberos):
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > 1) If I understand it correctly, we'll only
> > obtain
> > > > >> tokens
> > > > >> >> in
> > > > >> >> > > the
> > > > >> >> > > > > job
> > > > >> >> > > > > > > > > manager and then we'll distribute them via RPC
> > > (needs
> > > > >> to
> > > > >> >> be
> > > > >> >> > > > > secured).
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Can you please outline how the communication
> will
> > > > look
> > > > >> >> like?
> > > > >> >> > Is
> > > > >> >> > > > the
> > > > >> >> > > > > > > > > DelegationTokenManager going to be a part of
> the
> > > > >> >> > > ResourceManager?
> > > > >> >> > > > > Can
> > > > >> >> > > > > > > you
> > > > >> >> > > > > > > > > outline it's lifecycle / how it's going to be
> > > > >> integrated
> > > > >> >> > there?
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > 2) Do we really need a YARN / k8s specific
> > > > >> >> implementations?
> > > > >> >> > Is
> > > > >> >> > > it
> > > > >> >> > > > > > > > possible
> > > > >> >> > > > > > > > > to obtain / renew a token in a generic way?
> Maybe
> > > to
> > > > >> >> rephrase
> > > > >> >> > > > that,
> > > > >> >> > > > > > is
> > > > >> >> > > > > > > it
> > > > >> >> > > > > > > > > possible to implement DelegationTokenManager
> for
> > > the
> > > > >> >> > standalone
> > > > >> >> > > > > > Flink?
> > > > >> >> > > > > > > If
> > > > >> >> > > > > > > > > we're able to solve this point, it could be
> > > possible
> > > > to
> > > > >> >> > target
> > > > >> >> > > > all
> > > > >> >> > > > > > > > > deployment scenarios with a single
> > implementation.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Best,
> > > > >> >> > > > > > > > > D.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang <
> > > > >> >> > > > > > [email protected]>
> > > > >> >> > > > > > > > > wrote:
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > > Hi G
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Thanks for your explain in detail. I have
> > gotten
> > > > your
> > > > >> >> > > thoughts,
> > > > >> >> > > > > and
> > > > >> >> > > > > > > any
> > > > >> >> > > > > > > > > > way this proposal
> > > > >> >> > > > > > > > > > is a great improvement.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Looking forward to your implementation and i
> > will
> > > > >> keep
> > > > >> >> > focus
> > > > >> >> > > on
> > > > >> >> > > > > it.
> > > > >> >> > > > > > > > > > Thanks again.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Best
> > > > >> >> > > > > > > > > > JunFan.
> > > > >> >> > > > > > > > > > On Jan 13, 2022, 9:20 PM +0800, Gabor
> Somogyi <
> > > > >> >> > > > > > > > [email protected]
> > > > >> >> > > > > > > > > >,
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > > > Just to confirm keeping
> > > > >> >> > > > > > "security.kerberos.fetch.delegation-token"
> > > > >> >> > > > > > > is
> > > > >> >> > > > > > > > > > added
> > > > >> >> > > > > > > > > > > to the doc.
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > BR,
> > > > >> >> > > > > > > > > > > G
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > On Thu, Jan 13, 2022 at 1:34 PM Gabor
> > Somogyi <
> > > > >> >> > > > > > > > > [email protected]
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > Hi JunFan,
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > By the way, maybe this should be added
> in
> > > the
> > > > >> >> > migration
> > > > >> >> > > > > plan
> > > > >> >> > > > > > or
> > > > >> >> > > > > > > > > > > > intergation section in the FLIP-211.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > Going to add this soon.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Besides, I have a question that the KDC
> > > will
> > > > >> >> collapse
> > > > >> >> > > > when
> > > > >> >> > > > > > the
> > > > >> >> > > > > > > > > > cluster
> > > > >> >> > > > > > > > > > > > reached 200 nodes you described
> > > > >> >> > > > > > > > > > > > in the google doc. Do you have any
> > attachment
> > > > or
> > > > >> >> > > reference
> > > > >> >> > > > to
> > > > >> >> > > > > > > prove
> > > > >> >> > > > > > > > > it?
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > "KDC *may* collapse under some
> > circumstances"
> > > > is
> > > > >> the
> > > > >> >> > > proper
> > > > >> >> > > > > > > > wording.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > We have several customers who are
> executing
> > > > >> >> workloads
> > > > >> >> > on
> > > > >> >> > > > > > > > Spark/Flink.
> > > > >> >> > > > > > > > > > Most
> > > > >> >> > > > > > > > > > > > of the time I'm facing their
> > > > >> >> > > > > > > > > > > > daily issues which is heavily environment
> > and
> > > > >> >> use-case
> > > > >> >> > > > > > dependent.
> > > > >> >> > > > > > > > > I've
> > > > >> >> > > > > > > > > > > > seen various cases:
> > > > >> >> > > > > > > > > > > > * where the mentioned ~1k nodes were
> > working
> > > > fine
> > > > >> >> > > > > > > > > > > > * where KDC thought the number of
> requests
> > > are
> > > > >> >> coming
> > > > >> >> > > from
> > > > >> >> > > > > DDOS
> > > > >> >> > > > > > > > > attack
> > > > >> >> > > > > > > > > > so
> > > > >> >> > > > > > > > > > > > discontinued authentication
> > > > >> >> > > > > > > > > > > > * where KDC was simply not responding
> > because
> > > > of
> > > > >> the
> > > > >> >> > load
> > > > >> >> > > > > > > > > > > > * where KDC was intermittently had some
> > > outage
> > > > >> (this
> > > > >> >> > was
> > > > >> >> > > > the
> > > > >> >> > > > > > most
> > > > >> >> > > > > > > > > nasty
> > > > >> >> > > > > > > > > > > > thing)
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > Since you're managing relatively big
> > cluster
> > > > then
> > > > >> >> you
> > > > >> >> > > know
> > > > >> >> > > > > that
> > > > >> >> > > > > > > KDC
> > > > >> >> > > > > > > > > is
> > > > >> >> > > > > > > > > > not
> > > > >> >> > > > > > > > > > > > only used by Spark/Flink workloads
> > > > >> >> > > > > > > > > > > > but the whole company IT infrastructure
> is
> > > > >> bombing
> > > > >> >> it
> > > > >> >> > so
> > > > >> >> > > it
> > > > >> >> > > > > > > really
> > > > >> >> > > > > > > > > > depends
> > > > >> >> > > > > > > > > > > > on other factors too whether KDC is
> > reaching
> > > > >> >> > > > > > > > > > > > it's limit or not. Not sure what kind of
> > > > evidence
> > > > >> >> are
> > > > >> >> > you
> > > > >> >> > > > > > looking
> > > > >> >> > > > > > > > for
> > > > >> >> > > > > > > > > > but
> > > > >> >> > > > > > > > > > > > I'm not authorized to share any
> information
> > > > about
> > > > >> >> > > > > > > > > > > > our clients data.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > One thing is for sure. The more external
> > > system
> > > > >> >> types
> > > > >> >> > are
> > > > >> >> > > > > used
> > > > >> >> > > > > > in
> > > > >> >> > > > > > > > > > > > workloads (for ex. HDFS, HBase, Hive,
> > Kafka)
> > > > >> which
> > > > >> >> > > > > > > > > > > > are authenticating through KDC the more
> > > > >> possibility
> > > > >> >> to
> > > > >> >> > > > reach
> > > > >> >> > > > > > this
> > > > >> >> > > > > > > > > > > > threshold when the cluster is big enough.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > All in all this feature is here to help
> all
> > > > users
> > > > >> >> never
> > > > >> >> > > > reach
> > > > >> >> > > > > > > this
> > > > >> >> > > > > > > > > > > > limitation.
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > BR,
> > > > >> >> > > > > > > > > > > > G
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > On Thu, Jan 13, 2022 at 1:00 PM 张俊帆 <
> > > > >> >> > > > [email protected]
> > > > >> >> > > > > >
> > > > >> >> > > > > > > > wrote:
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Hi G
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Thanks for your quick reply. I think
> > > > reserving
> > > > >> the
> > > > >> >> > > config
> > > > >> >> > > > > of
> > > > >> >> > > > > > > > > > > > >
> > *security.kerberos.fetch.delegation-token*
> > > > >> >> > > > > > > > > > > > > and simplifying disable the token
> > fetching
> > > > is a
> > > > >> >> good
> > > > >> >> > > > > idea.By
> > > > >> >> > > > > > > the
> > > > >> >> > > > > > > > > way,
> > > > >> >> > > > > > > > > > > > > maybe this should be added
> > > > >> >> > > > > > > > > > > > > in the migration plan or intergation
> > > section
> > > > in
> > > > >> >> the
> > > > >> >> > > > > FLIP-211.
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Besides, I have a question that the KDC
> > > will
> > > > >> >> collapse
> > > > >> >> > > > when
> > > > >> >> > > > > > the
> > > > >> >> > > > > > > > > > cluster
> > > > >> >> > > > > > > > > > > > > reached 200 nodes you described
> > > > >> >> > > > > > > > > > > > > in the google doc. Do you have any
> > > attachment
> > > > >> or
> > > > >> >> > > > reference
> > > > >> >> > > > > to
> > > > >> >> > > > > > > > prove
> > > > >> >> > > > > > > > > > it?
> > > > >> >> > > > > > > > > > > > > Because in our internal per-cluster,
> > > > >> >> > > > > > > > > > > > > the nodes reaches > 1000 and KDC looks
> > > good.
> > > > >> Do i
> > > > >> >> > > missed
> > > > >> >> > > > or
> > > > >> >> > > > > > > > > > misunderstood
> > > > >> >> > > > > > > > > > > > > something? Please correct me.
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Best
> > > > >> >> > > > > > > > > > > > > JunFan.
> > > > >> >> > > > > > > > > > > > > On Jan 13, 2022, 5:26 PM +0800,
> > > > >> >> [email protected]
> > > > >> >> > ,
> > > > >> >> > > > > wrote:
> > > > >> >> > > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Reply via email to