Thank you for bringing this proposal up. It looks very good and we seem to be thinking along very similar lines.
Below are some comments and thoughts on the FLIP. *Internal vs. External Connectivity* That is a very helpful distinction, let's build on that. - I would suggest to treat eventually all communication coming potentially from users as external, meaning Client-to-Dispatcher, Client-to-JobManager (trigger savepoint, change parallelism, ...), Web UI, Queryable State. - That leaves communication that is only between JobManager/TaskManager/ResourceManager/Dispatcher/HistoryServer as internal. - I am somewhat operating under the assumption that all external communication will eventually be HTTP/REST. That works best with many setups and is the basis for using service proxies that handle authentication/authorization. In Flink 1.5 and future versions, we have the following update there: - Akka is now strictly internal connectivity, the client (except legacy client) do not use it any more. - The Blob Server will move to purely internal connectivity in Flink 1.6, where a POST of a job to the Dispatcher has the jars and the JobGraph. That is important for Kubernetes setups, where exposing the BlobServer and querying the blob port causes quite some friction. - Treating queryable state as "internal connectivity" is fine for now. We should treat it as "external" connectivity in the future if we move it to HTTP/REST. *Internal Connectivity and SSL Mutual Authentication* Simply activating SSL mutual authentication for the internal communication is a really low hanging fruit. Activating client authentication for Akka, network stack Netty (and Blob Server/Client in Flink 1.6) should require no change in the configurations with respect to Flink 1.4. All processes are, with respect to internal communication, simultaneously server and client endpoints. Because of that, they already need KeyStore and TrustStore files for SSL handshakes, where the TrustStore needs to trust the KeyStore Certificate. I personally favor the suggestion made to have a script that generates a self-signed certificate and adds it to "conf" and updates the configuration. That should be picked up by the Yarn and Mesos clients anyways. *External Connectivity* There is a huge surface area and I think we need to give users a way to plug in their own tools. >From what I see (and after some discussions with Patrick and Gary) I think it makes sense to look at proxies in a broad way, similar to the approach Eron outlined. The basic approach could be like that: - Everything goes through HTTPS, so the proxy can work with HTTP headers. - The proxy handles authentication and possibly authorization. The proxy adds some header, for example a user name, a group id, an authorization token. - Flink can configure an implementation of an 'authorizer' or validator on the headers to decide whether the request is valid. - Example 1: The proxy does authentication and adds the user name / group as a header. The the Flink-side authorizer simply checks whether the name is in the config (simple ACL-style) scheme. - Example 2: The proxy adds an JSON Web Token and the authorizer validates that token. For secure connections between the Proxy and the Flink Endpoint I would follow Eron's suggestion, to use separate KeyStores and TrustStores than for internal communication. For Yarn and Mesos, I would like to see if we could handle those again as a special case of the proxies above: - DCOS Admin Router forwards the user authentication token, so that could be another authorizer implementation. - In YARN we could see if can implement the IP filter via such an authorizer. *Hostname Verification* For internal communication, and especially on dynamic environments like Kubernetes, it is very hard to work with certificates and have hostname verification on. If we assume internal communication works strictly with a shared secret certificate and with client authentication, does hostname verification actually still add security in that particular setup? My understanding was that hostname verification is important to not have some valid certificate presented, but the one bound to the server you want to talk to. If we have anyways one trusted certificate only, isn't that already implied? On the other hand, it is still possible (and potentially valuable) for users in standalone mode to use keystores and truststores from a PKI, in which case there may still be an argument in favor of hostname verification. On Thu, May 10, 2018, 02:30 Eron Wright <eronwri...@gmail.com> wrote: > Hello, > > Given that some SSL enhancement bugs have been posted lately, I took some > time to revise FLIP-26 which explores how to harden both external and > internal communication. > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80453255 > > Some recent related issues: > - FLINK-9312 - mutual auth for intra-cluster communication > - FLINK-5030 - original SSL feature work > > There's also some recent discussion of how to use Flink SSL effectively in > a Kubernetes environment. The issue is about hostname verification. The > proposal that I've put forward in FLIP-26 is to not use hostname > verification for intra-cluster communication, but rather to rely in a > cluster-internal certificate and a truststore consisting only of that > certificate. Meanwhile, a new "external" certificate would be > configurable for the web/api endpoint and associated with a well-known DNS > name as provided by a K8s Service resource. > > Stephan is this in-line with your thinking re FLINK-9312? > > Thanks > Eron >