Sounds good, Eron! Please go ahead...
On Sat, Jul 28, 2018 at 1:33 AM, Eron Wright <eronwri...@gmail.com> wrote: > As an update to this thread, Stephan opted to split the internal/external > configuration (by providing overrides for a common SSL configuration): > https://github.com/apache/flink/pull/6326 > > Note that Akka doesn't support hostname verification in its 'classic' > remoting implementation (though the new Artery implementation apparently > does), and such verification wouldn't apply to the client certificate > anyway. So the reality is that one should use a limited truststore (never > the system truststore) for Akka communication. > > On the question of routing external communication thru the YARN resource > proxy or Mesos/DCOS admin router, the value proposition is: > a) simplifies service discovery on the part of external clients, > b) permits single sign-on (SSO) be delegating authentication to a central > authority, > c) facilitates access from outside the cluster, via a public address. > The main challenge is that the Flink client code must support a more > diverse array of authentication methods, e.g. Kerberos when communicating > with the YARN proxy. > > Given #6326, the next steps would be (unordered): > a) create an umbrella issue for the overall effort > b) dive into the authorization work for external communication > c) implement auto-generation of a certificate for internal communication > d) implement TLS on queryable state interface (FLINK-5029) > > I'll take care of (a) unless there is any objection. > -Eron > > > On Sun, May 13, 2018 at 5:45 AM Stephan Ewen <ewenstep...@gmail.com> > wrote: > > > Throwing in some more food for thought: > > > > An alternative to the above proposed separation of internal and external > > SSL would be the following: > > > > - We separate channel encryption and authentication > > - We use one common SSL layer (internal and external) that is in both > > cases only responsible for establishing an encrypted connection > > - Authentication / authorization internally is done by SASL with > > username/password or shared secret. > > - Authentication externally must be through a proxy and authorization > > based on a validating HTTP headers set by the proxy, as discussed above.. > > > > Advantages: > > - There is only one certificate needed, which could also be shared > across > > applications > > - One or two lines in the config authenticate and authorize internal > > communication > > - One could possibly still fall back to the other mode by skipping > > > > Open Questions / Disadvantages > > - Given that hostname verification during SSL handshake is not possible > > in many setups, the encrypted channel is vulnerable to man-in-the-middle > > attacks without mutual authentication. Not sure how serious that is, > > because it would need an attacker to have compromise network nodes of the > > cluster already. is that not a universal issue in the K8s world? > > > > This is anyways a bit hypothetical, because as long as we have akka > beneath > > the RPC layer, we cannot go with that approach. > > > > However, if we want to at least keep the door open towards something like > > that in the future, we would need to set up configuration in such a way > > that we have a "common SSL" configuration (keystore, truststore, etc.) > and > > internal/external options that override those. That would anyways be > > helpful for backwards compatibility. > > > > @Eron - what are your thoughts on that? > > > > > > > > > > > > > > > > > > On Sun, May 13, 2018 at 1:40 AM, Stephan Ewen <ewenstep...@gmail.com> > > wrote: > > > > > Thank you for bringing this proposal up. It looks very good and we seem > > to > > > be thinking along very similar lines. > > > > > > Below are some comments and thoughts on the FLIP. > > > > > > *Internal vs. External Connectivity* > > > > > > That is a very helpful distinction, let's build on that. > > > > > > - I would suggest to treat eventually all communication coming > > > potentially from users as external, meaning Client-to-Dispatcher, > > > Client-to-JobManager (trigger savepoint, change parallelism, ...), Web > > UI, > > > Queryable State. > > > > > > - That leaves communication that is only between > > JobManager/TaskManager/ > > > ResourceManager/Dispatcher/HistoryServer as internal. > > > > > > - I am somewhat operating under the assumption that all external > > > communication will eventually be HTTP/REST. That works best with many > > > setups and is the basis for using service proxies that > > > handle authentication/authorization. > > > > > > > > > In Flink 1.5 and future versions, we have the following update there: > > > > > > - Akka is now strictly internal connectivity, the client (except > legacy > > > client) do not use it any more. > > > > > > - The Blob Server will move to purely internal connectivity in Flink > > > 1.6, where a POST of a job to the Dispatcher has the jars and the > > JobGraph. > > > That is important for Kubernetes setups, where exposing the BlobServer > > and > > > querying the blob port causes quite some friction. > > > > > > - Treating queryable state as "internal connectivity" is fine for > now. > > > We should treat it as "external" connectivity in the future if we move > it > > > to HTTP/REST. > > > > > > > > > *Internal Connectivity and SSL Mutual Authentication* > > > > > > Simply activating SSL mutual authentication for the internal > > communication > > > is a really low hanging fruit. > > > > > > Activating client authentication for Akka, network stack Netty (and > Blob > > > Server/Client in Flink 1.6) should require no change in the > > configurations > > > with respect to Flink 1.4. All processes are, with respect to internal > > > communication, simultaneously server and client endpoints. Because of > > that, > > > they already need KeyStore and TrustStore files for SSL handshakes, > where > > > the TrustStore needs to trust the KeyStore Certificate. > > > > > > I personally favor the suggestion made to have a script that generates > a > > > self-signed certificate and adds it to "conf" and updates the > > > configuration. That should be picked up by the Yarn and Mesos clients > > > anyways. > > > > > > > > > *External Connectivity* > > > > > > There is a huge surface area and I think we need to give users a way to > > > plug in their own tools. > > > From what I see (and after some discussions with Patrick and Gary) I > > think > > > it makes sense to look at proxies in a broad way, similar to the > approach > > > Eron outlined. > > > > > > The basic approach could be like that: > > > > > > - Everything goes through HTTPS, so the proxy can work with HTTP > > headers. > > > - The proxy handles authentication and possibly authorization. The > > proxy > > > adds some header, for example a user name, a group id, an authorization > > > token. > > > - Flink can configure an implementation of an 'authorizer' or > validator > > > on the headers to decide whether the request is valid. > > > > > > - Example 1: The proxy does authentication and adds the user name / > > > group as a header. The the Flink-side authorizer simply checks whether > > the > > > name is in the config (simple ACL-style) scheme. > > > - Example 2: The proxy adds an JSON Web Token and the authorizer > > > validates that token. > > > > > > For secure connections between the Proxy and the Flink Endpoint I would > > > follow Eron's suggestion, to use separate KeyStores and TrustStores > than > > > for internal communication. > > > > > > For Yarn and Mesos, I would like to see if we could handle those again > as > > > a special case of the proxies above: > > > - DCOS Admin Router forwards the user authentication token, so that > > > could be another authorizer implementation. > > > - In YARN we could see if can implement the IP filter via such an > > > authorizer. > > > > > > > > > *Hostname Verification* > > > > > > For internal communication, and especially on dynamic environments like > > > Kubernetes, it is very hard to work with certificates and have hostname > > > verification on. > > > > > > If we assume internal communication works strictly with a shared secret > > > certificate and with client authentication, does hostname verification > > > actually still add security in that particular setup? My understanding > > was > > > that hostname verification is important to not have some valid > > certificate > > > presented, but the one bound to the server you want to talk to. If we > > have > > > anyways one trusted certificate only, isn't that already implied? > > > > > > On the other hand, it is still possible (and potentially valuable) for > > > users in standalone mode to use keystores and truststores from a PKI, > in > > > which case there may still be an argument in favor of hostname > > verification. > > > > > > On Thu, May 10, 2018, 02:30 Eron Wright <eronwri...@gmail.com> wrote: > > > > > >> Hello, > > >> > > >> Given that some SSL enhancement bugs have been posted lately, I took > > some > > >> time to revise FLIP-26 which explores how to harden both external and > > >> internal communication. > > >> > > >> > > https://cwiki.apache.org/confluence/pages/viewpage. > action?pageId=80453255 > > >> > > >> Some recent related issues: > > >> - FLINK-9312 - mutual auth for intra-cluster communication > > >> - FLINK-5030 - original SSL feature work > > >> > > >> There's also some recent discussion of how to use Flink SSL > effectively > > in > > >> a Kubernetes environment. The issue is about hostname verification. > > The > > >> proposal that I've put forward in FLIP-26 is to not use hostname > > >> verification for intra-cluster communication, but rather to rely in a > > >> cluster-internal certificate and a truststore consisting only of that > > >> certificate. Meanwhile, a new "external" certificate would be > > >> configurable for the web/api endpoint and associated with a well-known > > DNS > > >> name as provided by a K8s Service resource. > > >> > > >> Stephan is this in-line with your thinking re FLINK-9312? > > >> > > >> Thanks > > >> Eron > > >> > > > > > >