Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Gabor Somogyi Fri, 21 Jan 2022 10:42:57 -0800

1. One thing is important, token management is planned to be done
generically within Flink and not scattered in RM specific code. JobManager
has a DelegationTokenManager which obtains tokens time-to-time (if
configured properly). JM knows which TaskManagers are in place so it can
distribute it to all TMs. That's it basically.


2. 99.9% of the code is generic but each RM handles tokens differently. A
good example is YARN obtains tokens on client side and then sets them on
the newly created AM container launch context. This is purely YARN specific
and cant't be spared. With my actual plans standalone can be changed to use
the framework. By using it I mean no RM specific DTM or whatsoever is
needed.

There is a linked readme in the doc how it's solved within Spark, the main
concept is the same.

BR,
G

On Fri, 21 Jan 2022, 18:03 David Morávek, <[email protected]> wrote:

> Hi Gabor,
>
> thanks for drafting the FLIP, I think having a solid Kerberos support is
> crucial for many enterprise deployments.
>
> I have multiple questions regarding the implementation (note that I have
> very limited knowledge of Kerberos):
>
> 1) If I understand it correctly, we'll only obtain tokens in the job
> manager and then we'll distribute them via RPC (needs to be secured).
>
> Can you please outline how the communication will look like? Is the
> DelegationTokenManager going to be a part of the ResourceManager? Can you
> outline it's lifecycle / how it's going to be integrated there?
>
> 2) Do we really need a YARN / k8s specific implementations? Is it possible
> to obtain / renew a token in a generic way? Maybe to rephrase that, is it
> possible to implement DelegationTokenManager for the standalone Flink? If
> we're able to solve this point, it could be possible to target all
> deployment scenarios with a single implementation.
>
> Best,
> D.
>
> On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang <[email protected]>
> wrote:
>
> > Hi G
> >
> > Thanks for your explain in detail. I have gotten your thoughts, and any
> > way this proposal
> > is a great improvement.
> >
> > Looking forward to your implementation and i will keep focus on it.
> > Thanks again.
> >
> > Best
> > JunFan.
> > On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi <[email protected]
> >,
> > wrote:
> > > Just to confirm keeping "security.kerberos.fetch.delegation-token" is
> > added
> > > to the doc.
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi JunFan,
> > > >
> > > > > By the way, maybe this should be added in the migration plan or
> > > > intergation section in the FLIP-211.
> > > >
> > > > Going to add this soon.
> > > >
> > > > > Besides, I have a question that the KDC will collapse when the
> > cluster
> > > > reached 200 nodes you described
> > > > in the google doc. Do you have any attachment or reference to prove
> it?
> > > >
> > > > "KDC *may* collapse under some circumstances" is the proper wording.
> > > >
> > > > We have several customers who are executing workloads on Spark/Flink.
> > Most
> > > > of the time I'm facing their
> > > > daily issues which is heavily environment and use-case dependent.
> I've
> > > > seen various cases:
> > > > * where the mentioned ~1k nodes were working fine
> > > > * where KDC thought the number of requests are coming from DDOS
> attack
> > so
> > > > discontinued authentication
> > > > * where KDC was simply not responding because of the load
> > > > * where KDC was intermittently had some outage (this was the most
> nasty
> > > > thing)
> > > >
> > > > Since you're managing relatively big cluster then you know that KDC
> is
> > not
> > > > only used by Spark/Flink workloads
> > > > but the whole company IT infrastructure is bombing it so it really
> > depends
> > > > on other factors too whether KDC is reaching
> > > > it's limit or not. Not sure what kind of evidence are you looking for
> > but
> > > > I'm not authorized to share any information about
> > > > our clients data.
> > > >
> > > > One thing is for sure. The more external system types are used in
> > > > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > > > are authenticating through KDC the more possibility to reach this
> > > > threshold when the cluster is big enough.
> > > >
> > > > All in all this feature is here to help all users never reach this
> > > > limitation.
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Thu, Jan 13, 2022 at 1:00 PM 张俊帆 <[email protected]> wrote:
> > > >
> > > > > Hi G
> > > > >
> > > > > Thanks for your quick reply. I think reserving the config of
> > > > > *security.kerberos.fetch.delegation-token*
> > > > > and simplifying disable the token fetching is a good idea.By the
> way,
> > > > > maybe this should be added
> > > > > in the migration plan or intergation section in the FLIP-211.
> > > > >
> > > > > Besides, I have a question that the KDC will collapse when the
> > cluster
> > > > > reached 200 nodes you described
> > > > > in the google doc. Do you have any attachment or reference to prove
> > it?
> > > > > Because in our internal per-cluster,
> > > > > the nodes reaches > 1000 and KDC looks good. Do i missed or
> > misunderstood
> > > > > something? Please correct me.
> > > > >
> > > > > Best
> > > > > JunFan.
> > > > > On Jan 13, 2022, 5:26 PM +0800, [email protected], wrote:
> > > > > >
> > > > > >
> > > > >
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Reply via email to