Hi Till, Did you have the chance to take a look at the doc? Not yet seen any update.
BR, G On Wed, Jun 9, 2021 at 1:43 PM Till Rohrmann <trohrm...@apache.org> wrote: > Thanks for the update Gabor. I'll take a look and respond in the document. > > Cheers, > Till > > On Wed, Jun 9, 2021 at 12:59 PM Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > >> Hi Till, >> >> Your proxy suggestion has been considered in-depth and updated the FLIP >> accordingly. >> We've considered 2 proxy implementation (Nginx and Squid) but according >> to our analysis and testing it's not suitable for the mentioned use-cases. >> Please take a look at the rejected alternatives for detailed explanation. >> >> Thanks for your time in advance! >> >> BR, >> G >> >> >> On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> As I've said I am not a security expert and that's why I have to ask for >>> clarification, Gabor. You are saying that if we configure a truststore for >>> the REST endpoint with a single trusted certificate which has been >>> generated by the operator of the Flink cluster, then the attacker can >>> generate a new certificate, sign it and then talk to the Flink cluster if >>> he has access to the node on which the REST endpoint runs? My understanding >>> was that you need the corresponding private key which in my proposed setup >>> would be under the control of the operator as well (e.g. stored in a >>> keystore on the same machine but guarded by some secret). That way (if I am >>> not mistaken), only the entity which has access to the keystore is able to >>> talk to the Flink cluster. >>> >>> Maybe we are also getting our wires crossed here and are talking about >>> different things. >>> >>> Thanks for listing the pros and cons of Kerberos. Concerning what other >>> authentication mechanisms are used in the industry, I am not 100% sure. >>> >>> Cheers, >>> Till >>> >>> On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <gabor.g.somo...@gmail.com> >>> wrote: >>> >>>> > I did not mean for the user to sign its own certificates but for the >>>> operator of the cluster. Once the user request hits the proxy, it should no >>>> longer be under his control. I think I do not fully understand yet why this >>>> would not work. >>>> I said it's not solving the authentication problem over any proxy. Even >>>> if the operator is signing the certificate one can have access to an >>>> internal node. >>>> Such case anybody can craft certificates which is accepted by the >>>> server. When it's accepted a bad guy can cancel jobs causing huge impacts. >>>> >>>> > Also, I am missing a bit the comparison of Kerberos to other >>>> authentication mechanisms and why they were rejected in favour of Kerberos. >>>> PROS: >>>> * Since it's not depending on cloud provider and/or k8s or bare-metal >>>> etc. deployment it's the biggest plus >>>> * Centralized with tools and no need to write tons of tools around >>>> * There are clients/tools on almost all OS-es and several languages >>>> * Super huge users are using it for years in production w/o huge issues >>>> * Provides cross-realm trust possibility amongst other features >>>> * Several open source components using it which could increase >>>> compatibility >>>> >>>> CONS: >>>> * Not everybody using kerberos >>>> * It would increase the code footprint but this is true for many >>>> features (as a side note I'm here to maintain it) >>>> >>>> Feel free to add your points because it only represents a single >>>> viewpoint. >>>> Also if you have any better option for strong authentication please >>>> share it and we can consider the pros/cons here. >>>> >>>> BR, >>>> G >>>> >>>> >>>> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> I did not mean for the user to sign its own certificates but for the >>>>> operator of the cluster. Once the user request hits the proxy, it should >>>>> no >>>>> longer be under his control. I think I do not fully understand yet why >>>>> this >>>>> would not work. >>>>> >>>>> What I would like to avoid is to add more complexity into Flink if >>>>> there is an easy solution which fulfills the requirements. That's why I >>>>> would like to exercise thoroughly through the different alternatives. >>>>> Also, >>>>> I am missing a bit the comparison of Kerberos to other authentication >>>>> mechanisms and why they were rejected in favour of Kerberos. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <gyf...@apache.org> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> I think there might be possible alternatives but it seems Kerberos on >>>>>> the rest endpoint ticks all the right boxes and provides a super clean >>>>>> and >>>>>> simple solution for strong authentication. >>>>>> >>>>>> I wouldn’t even consider sidecar proxies etc if we can solve it in >>>>>> such a simple way as proposed by G. >>>>>> >>>>>> Cheers >>>>>> Gyula >>>>>> >>>>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <trohrm...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I am not saying that we shouldn't add a strong authentication >>>>>>> mechanism if there are good reasons for it. I primarily would like to >>>>>>> understand the context a bit better in order to give qualified feedback >>>>>>> and >>>>>>> come to a good decision. In order to do this, I have the feeling that we >>>>>>> haven't fully considered all available options which are on the table, >>>>>>> tbh. >>>>>>> >>>>>>> Does the problem of certificate expiry also apply for self-signed >>>>>>> certificates? If yes, then this should then also be a problem for the >>>>>>> internal encryption of Flink's communication. If not, then one could use >>>>>>> self-signed certificates with a longer validity to solve the mentioned >>>>>>> issue. >>>>>>> >>>>>>> I think you can set up Flink in such a way that you don't have to >>>>>>> handle all the different certificates. For example, you could deploy >>>>>>> Flink >>>>>>> with a "sidecar proxy" which is responsible for the authentication >>>>>>> using an >>>>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a >>>>>>> local >>>>>>> network interface. That way, the REST endpoint would only be available >>>>>>> through the sidecar proxy. Additionally, one could enable SSL for this >>>>>>> communication. Would this be a solution for the problem? >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi < >>>>>>> balassi.mar...@gmail.com> wrote: >>>>>>> >>>>>>>> That is an interesting idea, Till. >>>>>>>> >>>>>>>> The main issue with it is that TLS certificates have an expiration >>>>>>>> time, usually they get approved for a couple years. Forcing our users >>>>>>>> to >>>>>>>> restart jobs to reprovision TLS certificates would be weird when we >>>>>>>> could >>>>>>>> just implement a single proper strong authentication mechanism instead >>>>>>>> in a >>>>>>>> couple hundred lines of code. :-) >>>>>>>> >>>>>>>> In many cases it is also impractical to go the TLS mutual route, >>>>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn >>>>>>>> cluster >>>>>>>> which means that we need a certificate per node (due to the mutual >>>>>>>> auth), >>>>>>>> but if we also want to protect the private key of these from users >>>>>>>> accidentally or intentionally leaking them then we need this per user. >>>>>>>> As >>>>>>>> in we end up managing user*machine number certificates and having to >>>>>>>> renew >>>>>>>> them periodically, which albeit automatable is unfortunately not yet >>>>>>>> automated in all large organizations. >>>>>>>> >>>>>>>> I fully agree that TLS certificate mutual authentication has its >>>>>>>> nice properties, especially at very large (multiple thousand node) >>>>>>>> clusters >>>>>>>> - but it has its own challenges too. Thanks for bringing it up. >>>>>>>> >>>>>>>> Happy to have this added to the rejected alternative list so that >>>>>>>> we have the full picture documented. >>>>>>>> >>>>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <trohrm...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I guess the idea would then be to let the proxy do the >>>>>>>>> authentication job and only forward the request via an SSL mutually >>>>>>>>> encrypted connection to the Flink cluster. Would this be possible? The >>>>>>>>> beauty of this setup is in my opinion that this setup should work >>>>>>>>> with all >>>>>>>>> kinds of authentication mechanisms. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi < >>>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks for giving options to fulfil the need. >>>>>>>>>> >>>>>>>>>> Users are looking for a solution where users can be identified on >>>>>>>>>> the whole cluster and restrict access to resources/actions. >>>>>>>>>> A good example for such an action is cancelling other users >>>>>>>>>> running jobs. >>>>>>>>>> >>>>>>>>>> * SSL does provide mutual authentication but when authentication >>>>>>>>>> passed there is no user based on restrictions can be made. >>>>>>>>>> * The less problematic part is that generating/maintaining short >>>>>>>>>> time valid certificates would be a hard (that's the reason KDC like >>>>>>>>>> servers >>>>>>>>>> exist). >>>>>>>>>> Having long time valid certificates would widen the attack >>>>>>>>>> surface but since the first concern is there this is just a cosmetic >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> All in all using TLS certificates is not sufficient in these >>>>>>>>>> environments unfortunately. >>>>>>>>>> >>>>>>>>>> BR, >>>>>>>>>> G >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann < >>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for the information Gabor. If it is about securing the >>>>>>>>>>> communication between the REST client and the REST server, then >>>>>>>>>>> Flink >>>>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this >>>>>>>>>>> be >>>>>>>>>>> enough to secure the communication and to pass an audit? >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Till >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi < >>>>>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Till, >>>>>>>>>>>> >>>>>>>>>>>> Since I'm working in security area 10+ years let me share my >>>>>>>>>>>> thought. >>>>>>>>>>>> I would like to emphasise there are experts better than me but >>>>>>>>>>>> I have some >>>>>>>>>>>> basics. >>>>>>>>>>>> The discussion is open and not trying to tell alone things... >>>>>>>>>>>> >>>>>>>>>>>> > I mean if an attacker can get access to one of the machines, >>>>>>>>>>>> then it >>>>>>>>>>>> should also be possible to obtain the right Kerberos token. >>>>>>>>>>>> Not necessarily. For example if one gets access to a specific >>>>>>>>>>>> user's >>>>>>>>>>>> credentials then it's not possible to compromise other user's >>>>>>>>>>>> jobs, data, >>>>>>>>>>>> etc... >>>>>>>>>>>> Security is like an onion, the more layers has been added the >>>>>>>>>>>> more time an >>>>>>>>>>>> attacker needs to proceed. >>>>>>>>>>>> At the end of the day if one is in, then most probably can find >>>>>>>>>>>> the way but >>>>>>>>>>>> this time is normally enough to sysadmins or security experts to >>>>>>>>>>>> close down the system and minimize the damage. >>>>>>>>>>>> >>>>>>>>>>>> The other thing is that all tokens has a timeout and if the >>>>>>>>>>>> token is >>>>>>>>>>>> invalid then the attacker can't proceed further. >>>>>>>>>>>> >>>>>>>>>>>> > Is Kerberos also the standard authentication protocol for >>>>>>>>>>>> Kubernetes >>>>>>>>>>>> deployments? >>>>>>>>>>>> Kerberos is an industry standard which is cloud/deployment >>>>>>>>>>>> agnostic and it >>>>>>>>>>>> can be used in any deployments including k8s. >>>>>>>>>>>> The main intention is to use kerberos in k8s deployments too >>>>>>>>>>>> since we're >>>>>>>>>>>> going this direction as well. >>>>>>>>>>>> Please see how Spark does this: >>>>>>>>>>>> >>>>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes >>>>>>>>>>>> >>>>>>>>>>>> Last but not least the most important reason to add at least >>>>>>>>>>>> one strong >>>>>>>>>>>> authentication is that we have users who has >>>>>>>>>>>> hard requirements on this. They're doing security audits and if >>>>>>>>>>>> they fail >>>>>>>>>>>> then it's deal breaking. >>>>>>>>>>>> That is why we have added kerberos at the first place. >>>>>>>>>>>> Unfortunately we >>>>>>>>>>>> can't name them in this public list, however >>>>>>>>>>>> the customers who specifically asked for this were mainly in >>>>>>>>>>>> the banking >>>>>>>>>>>> and telco sector. >>>>>>>>>>>> >>>>>>>>>>>> BR, >>>>>>>>>>>> G >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann < >>>>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks >>>>>>>>>>>> will >>>>>>>>>>>> > consider it more secure if Flink comes with Kerberos >>>>>>>>>>>> authentication >>>>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker >>>>>>>>>>>> can get access >>>>>>>>>>>> > to one of the machines, then it should also be possible to >>>>>>>>>>>> obtain the right >>>>>>>>>>>> > Kerberos token. >>>>>>>>>>>> > >>>>>>>>>>>> > I am not an authentication expert and that's why I wanted to >>>>>>>>>>>> ask what are >>>>>>>>>>>> > other authentication protocols other than Kerberos? Why did >>>>>>>>>>>> we select >>>>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you >>>>>>>>>>>> can list the >>>>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also >>>>>>>>>>>> the standard >>>>>>>>>>>> > authentication protocol for Kubernetes deployments? If not, >>>>>>>>>>>> what would be >>>>>>>>>>>> > the answer when deploying on K8s? >>>>>>>>>>>> > >>>>>>>>>>>> > Cheers, >>>>>>>>>>>> > Till >>>>>>>>>>>> > >>>>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi < >>>>>>>>>>>> gabor.g.somo...@gmail.com> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> > >>>>>>>>>>>> >> Hi team, >>>>>>>>>>>> >> >>>>>>>>>>>> >> Happy to be here and hope I can provide quality additions in >>>>>>>>>>>> the future. >>>>>>>>>>>> >> >>>>>>>>>>>> >> Thank you all for helpful the suggestions! >>>>>>>>>>>> >> Considering them the FLIP has been modified and the work >>>>>>>>>>>> continues on the >>>>>>>>>>>> >> already existing Jira. >>>>>>>>>>>> >> >>>>>>>>>>>> >> BR, >>>>>>>>>>>> >> G >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi < >>>>>>>>>>>> balassi.mar...@gmail.com> >>>>>>>>>>>> >> wrote: >>>>>>>>>>>> >> >>>>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the >>>>>>>>>>>> ticket too, let >>>>>>>>>>>> >>> us continue there then. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as >>>>>>>>>>>> possible. It >>>>>>>>>>>> >>> is an important design decision that we aim to keep the >>>>>>>>>>>> list of >>>>>>>>>>>> >>> authentication protocols to a minimum. We believe that this >>>>>>>>>>>> should not be a >>>>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for >>>>>>>>>>>> example Apache >>>>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser >>>>>>>>>>>> authentication >>>>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms >>>>>>>>>>>> to support >>>>>>>>>>>> >>> consequently consist of a single strong authentication >>>>>>>>>>>> protocol for which >>>>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary >>>>>>>>>>>> for development >>>>>>>>>>>> >>> and light-weight scenarios. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Added the above wording to G's doc. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler < >>>>>>>>>>>> ches...@apache.org> >>>>>>>>>>>> >>> wrote: >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>> There's a related effort: >>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108 >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote: >>>>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community! >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > Thanks for sharing this proposal with the community >>>>>>>>>>>> Márton. In >>>>>>>>>>>> >>>> general, I >>>>>>>>>>>> >>>> > agree that authentication is missing and that this is >>>>>>>>>>>> required for >>>>>>>>>>>> >>>> using >>>>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is >>>>>>>>>>>> whether this >>>>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink >>>>>>>>>>>> or whether a >>>>>>>>>>>> >>>> proxy >>>>>>>>>>>> >>>> > setup could do the job? Have you considered this option? >>>>>>>>>>>> If yes, then >>>>>>>>>>>> >>>> it >>>>>>>>>>>> >>>> > would be good to list it under the point of rejected >>>>>>>>>>>> alternatives. >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside >>>>>>>>>>>> of Flink if >>>>>>>>>>>> >>>> many >>>>>>>>>>>> >>>> > users need it. If not, then it might be easier for the >>>>>>>>>>>> project to not >>>>>>>>>>>> >>>> > increase the surface area since it makes the overall >>>>>>>>>>>> maintenance >>>>>>>>>>>> >>>> harder. >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > Cheers, >>>>>>>>>>>> >>>> > Till >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi < >>>>>>>>>>>> mbala...@apache.org> >>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> >> Hi team, >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for >>>>>>>>>>>> short to the >>>>>>>>>>>> >>>> >> community, he is a Spark committer who has recently >>>>>>>>>>>> transitioned to >>>>>>>>>>>> >>>> the >>>>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking >>>>>>>>>>>> forward to >>>>>>>>>>>> >>>> contributing >>>>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on >>>>>>>>>>>> Spark Streaming >>>>>>>>>>>> >>>> and >>>>>>>>>>>> >>>> >> security. >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> Based on requests from our customers G has implemented >>>>>>>>>>>> Kerberos and >>>>>>>>>>>> >>>> HTTP >>>>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and >>>>>>>>>>>> HistoryServer. >>>>>>>>>>>> >>>> Previously >>>>>>>>>>>> >>>> >> lacked an authentication story. >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> We are looking to contribute this functionality back to >>>>>>>>>>>> the >>>>>>>>>>>> >>>> community, we >>>>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a >>>>>>>>>>>> common code >>>>>>>>>>>> >>>> solution >>>>>>>>>>>> >>>> >> for this general pattern. >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. >>>>>>>>>>>> [2] >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/ >>>>>>>>>>>> >>>> >> [2] >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>>>>>>>>>