Hi Till, Your proxy suggestion has been considered in-depth and updated the FLIP accordingly. We've considered 2 proxy implementation (Nginx and Squid) but according to our analysis and testing it's not suitable for the mentioned use-cases. Please take a look at the rejected alternatives for detailed explanation.
Thanks for your time in advance! BR, G On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <trohrm...@apache.org> wrote: > As I've said I am not a security expert and that's why I have to ask for > clarification, Gabor. You are saying that if we configure a truststore for > the REST endpoint with a single trusted certificate which has been > generated by the operator of the Flink cluster, then the attacker can > generate a new certificate, sign it and then talk to the Flink cluster if > he has access to the node on which the REST endpoint runs? My understanding > was that you need the corresponding private key which in my proposed setup > would be under the control of the operator as well (e.g. stored in a > keystore on the same machine but guarded by some secret). That way (if I am > not mistaken), only the entity which has access to the keystore is able to > talk to the Flink cluster. > > Maybe we are also getting our wires crossed here and are talking about > different things. > > Thanks for listing the pros and cons of Kerberos. Concerning what other > authentication mechanisms are used in the industry, I am not 100% sure. > > Cheers, > Till > > On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > >> > I did not mean for the user to sign its own certificates but for the >> operator of the cluster. Once the user request hits the proxy, it should no >> longer be under his control. I think I do not fully understand yet why this >> would not work. >> I said it's not solving the authentication problem over any proxy. Even >> if the operator is signing the certificate one can have access to an >> internal node. >> Such case anybody can craft certificates which is accepted by the server. >> When it's accepted a bad guy can cancel jobs causing huge impacts. >> >> > Also, I am missing a bit the comparison of Kerberos to other >> authentication mechanisms and why they were rejected in favour of Kerberos. >> PROS: >> * Since it's not depending on cloud provider and/or k8s or bare-metal >> etc. deployment it's the biggest plus >> * Centralized with tools and no need to write tons of tools around >> * There are clients/tools on almost all OS-es and several languages >> * Super huge users are using it for years in production w/o huge issues >> * Provides cross-realm trust possibility amongst other features >> * Several open source components using it which could increase >> compatibility >> >> CONS: >> * Not everybody using kerberos >> * It would increase the code footprint but this is true for many features >> (as a side note I'm here to maintain it) >> >> Feel free to add your points because it only represents a single >> viewpoint. >> Also if you have any better option for strong authentication please share >> it and we can consider the pros/cons here. >> >> BR, >> G >> >> >> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> I did not mean for the user to sign its own certificates but for the >>> operator of the cluster. Once the user request hits the proxy, it should no >>> longer be under his control. I think I do not fully understand yet why this >>> would not work. >>> >>> What I would like to avoid is to add more complexity into Flink if there >>> is an easy solution which fulfills the requirements. That's why I would >>> like to exercise thoroughly through the different alternatives. Also, I am >>> missing a bit the comparison of Kerberos to other authentication mechanisms >>> and why they were rejected in favour of Kerberos. >>> >>> Cheers, >>> Till >>> >>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <gyf...@apache.org> wrote: >>> >>>> Hi! >>>> >>>> I think there might be possible alternatives but it seems Kerberos on >>>> the rest endpoint ticks all the right boxes and provides a super clean and >>>> simple solution for strong authentication. >>>> >>>> I wouldn’t even consider sidecar proxies etc if we can solve it in such >>>> a simple way as proposed by G. >>>> >>>> Cheers >>>> Gyula >>>> >>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> I am not saying that we shouldn't add a strong authentication >>>>> mechanism if there are good reasons for it. I primarily would like to >>>>> understand the context a bit better in order to give qualified feedback >>>>> and >>>>> come to a good decision. In order to do this, I have the feeling that we >>>>> haven't fully considered all available options which are on the table, >>>>> tbh. >>>>> >>>>> Does the problem of certificate expiry also apply for self-signed >>>>> certificates? If yes, then this should then also be a problem for the >>>>> internal encryption of Flink's communication. If not, then one could use >>>>> self-signed certificates with a longer validity to solve the mentioned >>>>> issue. >>>>> >>>>> I think you can set up Flink in such a way that you don't have to >>>>> handle all the different certificates. For example, you could deploy Flink >>>>> with a "sidecar proxy" which is responsible for the authentication using >>>>> an >>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a >>>>> local >>>>> network interface. That way, the REST endpoint would only be available >>>>> through the sidecar proxy. Additionally, one could enable SSL for this >>>>> communication. Would this be a solution for the problem? >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi < >>>>> balassi.mar...@gmail.com> wrote: >>>>> >>>>>> That is an interesting idea, Till. >>>>>> >>>>>> The main issue with it is that TLS certificates have an expiration >>>>>> time, usually they get approved for a couple years. Forcing our users to >>>>>> restart jobs to reprovision TLS certificates would be weird when we could >>>>>> just implement a single proper strong authentication mechanism instead >>>>>> in a >>>>>> couple hundred lines of code. :-) >>>>>> >>>>>> In many cases it is also impractical to go the TLS mutual route, >>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn >>>>>> cluster >>>>>> which means that we need a certificate per node (due to the mutual auth), >>>>>> but if we also want to protect the private key of these from users >>>>>> accidentally or intentionally leaking them then we need this per user. As >>>>>> in we end up managing user*machine number certificates and having to >>>>>> renew >>>>>> them periodically, which albeit automatable is unfortunately not yet >>>>>> automated in all large organizations. >>>>>> >>>>>> I fully agree that TLS certificate mutual authentication has its nice >>>>>> properties, especially at very large (multiple thousand node) clusters - >>>>>> but it has its own challenges too. Thanks for bringing it up. >>>>>> >>>>>> Happy to have this added to the rejected alternative list so that we >>>>>> have the full picture documented. >>>>>> >>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <trohrm...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I guess the idea would then be to let the proxy do the >>>>>>> authentication job and only forward the request via an SSL mutually >>>>>>> encrypted connection to the Flink cluster. Would this be possible? The >>>>>>> beauty of this setup is in my opinion that this setup should work with >>>>>>> all >>>>>>> kinds of authentication mechanisms. >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi < >>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks for giving options to fulfil the need. >>>>>>>> >>>>>>>> Users are looking for a solution where users can be identified on >>>>>>>> the whole cluster and restrict access to resources/actions. >>>>>>>> A good example for such an action is cancelling other users running >>>>>>>> jobs. >>>>>>>> >>>>>>>> * SSL does provide mutual authentication but when authentication >>>>>>>> passed there is no user based on restrictions can be made. >>>>>>>> * The less problematic part is that generating/maintaining short >>>>>>>> time valid certificates would be a hard (that's the reason KDC like >>>>>>>> servers >>>>>>>> exist). >>>>>>>> Having long time valid certificates would widen the attack surface >>>>>>>> but since the first concern is there this is just a cosmetic issue. >>>>>>>> >>>>>>>> All in all using TLS certificates is not sufficient in these >>>>>>>> environments unfortunately. >>>>>>>> >>>>>>>> BR, >>>>>>>> G >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <trohrm...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the information Gabor. If it is about securing the >>>>>>>>> communication between the REST client and the REST server, then Flink >>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be >>>>>>>>> enough to secure the communication and to pass an audit? >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi < >>>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Till, >>>>>>>>>> >>>>>>>>>> Since I'm working in security area 10+ years let me share my >>>>>>>>>> thought. >>>>>>>>>> I would like to emphasise there are experts better than me but I >>>>>>>>>> have some >>>>>>>>>> basics. >>>>>>>>>> The discussion is open and not trying to tell alone things... >>>>>>>>>> >>>>>>>>>> > I mean if an attacker can get access to one of the machines, >>>>>>>>>> then it >>>>>>>>>> should also be possible to obtain the right Kerberos token. >>>>>>>>>> Not necessarily. For example if one gets access to a specific >>>>>>>>>> user's >>>>>>>>>> credentials then it's not possible to compromise other user's >>>>>>>>>> jobs, data, >>>>>>>>>> etc... >>>>>>>>>> Security is like an onion, the more layers has been added the >>>>>>>>>> more time an >>>>>>>>>> attacker needs to proceed. >>>>>>>>>> At the end of the day if one is in, then most probably can find >>>>>>>>>> the way but >>>>>>>>>> this time is normally enough to sysadmins or security experts to >>>>>>>>>> close down the system and minimize the damage. >>>>>>>>>> >>>>>>>>>> The other thing is that all tokens has a timeout and if the token >>>>>>>>>> is >>>>>>>>>> invalid then the attacker can't proceed further. >>>>>>>>>> >>>>>>>>>> > Is Kerberos also the standard authentication protocol for >>>>>>>>>> Kubernetes >>>>>>>>>> deployments? >>>>>>>>>> Kerberos is an industry standard which is cloud/deployment >>>>>>>>>> agnostic and it >>>>>>>>>> can be used in any deployments including k8s. >>>>>>>>>> The main intention is to use kerberos in k8s deployments too >>>>>>>>>> since we're >>>>>>>>>> going this direction as well. >>>>>>>>>> Please see how Spark does this: >>>>>>>>>> >>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes >>>>>>>>>> >>>>>>>>>> Last but not least the most important reason to add at least one >>>>>>>>>> strong >>>>>>>>>> authentication is that we have users who has >>>>>>>>>> hard requirements on this. They're doing security audits and if >>>>>>>>>> they fail >>>>>>>>>> then it's deal breaking. >>>>>>>>>> That is why we have added kerberos at the first place. >>>>>>>>>> Unfortunately we >>>>>>>>>> can't name them in this public list, however >>>>>>>>>> the customers who specifically asked for this were mainly in the >>>>>>>>>> banking >>>>>>>>>> and telco sector. >>>>>>>>>> >>>>>>>>>> BR, >>>>>>>>>> G >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann < >>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks >>>>>>>>>> will >>>>>>>>>> > consider it more secure if Flink comes with Kerberos >>>>>>>>>> authentication >>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can >>>>>>>>>> get access >>>>>>>>>> > to one of the machines, then it should also be possible to >>>>>>>>>> obtain the right >>>>>>>>>> > Kerberos token. >>>>>>>>>> > >>>>>>>>>> > I am not an authentication expert and that's why I wanted to >>>>>>>>>> ask what are >>>>>>>>>> > other authentication protocols other than Kerberos? Why did we >>>>>>>>>> select >>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you >>>>>>>>>> can list the >>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also the >>>>>>>>>> standard >>>>>>>>>> > authentication protocol for Kubernetes deployments? If not, >>>>>>>>>> what would be >>>>>>>>>> > the answer when deploying on K8s? >>>>>>>>>> > >>>>>>>>>> > Cheers, >>>>>>>>>> > Till >>>>>>>>>> > >>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi < >>>>>>>>>> gabor.g.somo...@gmail.com> >>>>>>>>>> > wrote: >>>>>>>>>> > >>>>>>>>>> >> Hi team, >>>>>>>>>> >> >>>>>>>>>> >> Happy to be here and hope I can provide quality additions in >>>>>>>>>> the future. >>>>>>>>>> >> >>>>>>>>>> >> Thank you all for helpful the suggestions! >>>>>>>>>> >> Considering them the FLIP has been modified and the work >>>>>>>>>> continues on the >>>>>>>>>> >> already existing Jira. >>>>>>>>>> >> >>>>>>>>>> >> BR, >>>>>>>>>> >> G >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi < >>>>>>>>>> balassi.mar...@gmail.com> >>>>>>>>>> >> wrote: >>>>>>>>>> >> >>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the >>>>>>>>>> ticket too, let >>>>>>>>>> >>> us continue there then. >>>>>>>>>> >>> >>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as >>>>>>>>>> possible. It >>>>>>>>>> >>> is an important design decision that we aim to keep the list >>>>>>>>>> of >>>>>>>>>> >>> authentication protocols to a minimum. We believe that this >>>>>>>>>> should not be a >>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for >>>>>>>>>> example Apache >>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser >>>>>>>>>> authentication >>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to >>>>>>>>>> support >>>>>>>>>> >>> consequently consist of a single strong authentication >>>>>>>>>> protocol for which >>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary >>>>>>>>>> for development >>>>>>>>>> >>> and light-weight scenarios. >>>>>>>>>> >>> >>>>>>>>>> >>> Added the above wording to G's doc. >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler < >>>>>>>>>> ches...@apache.org> >>>>>>>>>> >>> wrote: >>>>>>>>>> >>> >>>>>>>>>> >>>> There's a related effort: >>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108 >>>>>>>>>> >>>> >>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote: >>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community! >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > Thanks for sharing this proposal with the community >>>>>>>>>> Márton. In >>>>>>>>>> >>>> general, I >>>>>>>>>> >>>> > agree that authentication is missing and that this is >>>>>>>>>> required for >>>>>>>>>> >>>> using >>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is >>>>>>>>>> whether this >>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink >>>>>>>>>> or whether a >>>>>>>>>> >>>> proxy >>>>>>>>>> >>>> > setup could do the job? Have you considered this option? >>>>>>>>>> If yes, then >>>>>>>>>> >>>> it >>>>>>>>>> >>>> > would be good to list it under the point of rejected >>>>>>>>>> alternatives. >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside >>>>>>>>>> of Flink if >>>>>>>>>> >>>> many >>>>>>>>>> >>>> > users need it. If not, then it might be easier for the >>>>>>>>>> project to not >>>>>>>>>> >>>> > increase the surface area since it makes the overall >>>>>>>>>> maintenance >>>>>>>>>> >>>> harder. >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > Cheers, >>>>>>>>>> >>>> > Till >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi < >>>>>>>>>> mbala...@apache.org> >>>>>>>>>> >>>> wrote: >>>>>>>>>> >>>> > >>>>>>>>>> >>>> >> Hi team, >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for >>>>>>>>>> short to the >>>>>>>>>> >>>> >> community, he is a Spark committer who has recently >>>>>>>>>> transitioned to >>>>>>>>>> >>>> the >>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward >>>>>>>>>> to >>>>>>>>>> >>>> contributing >>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark >>>>>>>>>> Streaming >>>>>>>>>> >>>> and >>>>>>>>>> >>>> >> security. >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> Based on requests from our customers G has implemented >>>>>>>>>> Kerberos and >>>>>>>>>> >>>> HTTP >>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and >>>>>>>>>> HistoryServer. >>>>>>>>>> >>>> Previously >>>>>>>>>> >>>> >> lacked an authentication story. >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> We are looking to contribute this functionality back to >>>>>>>>>> the >>>>>>>>>> >>>> community, we >>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a >>>>>>>>>> common code >>>>>>>>>> >>>> solution >>>>>>>>>> >>>> >> for this general pattern. >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2] >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/ >>>>>>>>>> >>>> >> [2] >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>> >>>> >> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>>>>>>>