Thank you Edward and Christophe! 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo < edward.roja...@gmail.com>:
> Hi all, > > I did some tests based on the PR Christophe mentioned above and by making > a change on the NettyClient to use CanonicalHostName instead of > HostNameAddress to identify the server, the SSL validation works!! > > I created a PR with this change: https://github.com/apache/flink/pull/5789 > > Regards, > Edward > > 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo < > edward.roja...@gmail.com>: > >> Hi Till, >> >> I just created the JIRA ticket: https://issues.apache.org/jira >> /browse/FLINK-9103 >> >> I added the JobManager and TaskManager logs, Hope this helps to resolve >> the issue. >> >> Regards, >> Edward >> >> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: >> >>> Hi Edward, >>> >>> could you please file a JIRA issue for this problem. It might be as >>> simple as that the TaskManager's network stack uses the IP instead of the >>> hostname as you suggested. But we have to look into this to be sure. Also >>> the logs of the JobManager as well as the TaskManagers could be helpful. >>> >>> Cheers, >>> Till >>> >>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com> >>> wrote: >>> >>>> >>>> I suspect this relates to: https://issues.apache.org/ >>>> jira/browse/FLINK-5030 >>>> >>>> For which there was a PR at some point but nothing has been done so >>>> far. It seems the current code explicitly uses the IP vs Hostname for Netty >>>> SSL configuration. >>>> >>>> Without that I'm really wondering how people are reasonably using SSL >>>> on a Kubernetes Flink-based cluster as every time a pod is (re-started) it >>>> can theoretically take a different IP? Or do I miss something? >>>> >>>> -- >>>> Christophe >>>> >>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo < >>>> edward.roja...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with >>>>> SSL configuration based on https://ci.apache.org/proje >>>>> cts/flink/flink-docs-master/ops/security-ssl.html. >>>>> >>>>> However, as the IP of the nodes are dynamic (from the nature of >>>>> kubernetes), we are using only the DNS which we can control using >>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the >>>>> flink-jobmanager DNS and also the DNS for the task managers >>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form >>>>> flink-taskmanager-0.flink-taskmanager-svc). >>>>> >>>>> Additionally we set the jobmanager.rpc.address property on all the >>>>> nodes and each task manager sets the taskmanager.host property, all >>>>> matching the ones on the certificate. >>>>> >>>>> This is working well when using Job with Parallelism set to 1. The SSL >>>>> validations are good and the Jobmanager can communicate with Task manager >>>>> and vice versa. >>>>> >>>>> But when we set the parallelism to more than 1 we have exceptions on >>>>> the SSL validation like this: >>>>> >>>>> Caused by: java.security.cert.CertificateException: No subject >>>>> alternative names matching IP address 172.30.247.163 found >>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja >>>>> va:168) >>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) >>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>>> tManagerImpl.java:455) >>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>>> tManagerImpl.java:436) >>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust >>>>> ManagerImpl.java:252) >>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50 >>>>> 9TrustManagerImpl.java:136) >>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa >>>>> ndshaker.java:1601) >>>>> ... 21 more >>>>> >>>>> >>>>> From the logs I see the Jobmanager is correctly registering the >>>>> taskmanagers: >>>>> >>>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl >>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) >>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered >>>>> hosts is 3. Current number of alive task slots is 6. >>>>> >>>>> And also each taskmanager is correctly registered to use the hostname >>>>> for communication: >>>>> >>>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will >>>>> use hostname/address 'flink-taskmanager-1.flink-tas >>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for >>>>> communication. >>>>> ... >>>>> akka.remote.Remoting - Remoting started; listening on addresses >>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager >>>>> -svc.default.svc.cluster.local:6122] >>>>> ... >>>>> org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig >>>>> [server address: flink-taskmanager-1.flink-task >>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port: >>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport >>>>> type: NIO, number of server threads: 2 (manual), number of client threads: >>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client >>>>> connect >>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >>>>> default)] >>>>> ... >>>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data >>>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @ >>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local >>>>> (dataPort=6121) >>>>> >>>>> >>>>> >>>>> But even with that, it seems like the taskmanagers are using the IP >>>>> communicate between them and the SSL validation fails. >>>>> >>>>> Do you know if it's possible to make the taskmanagers to use the >>>>> hostname to communicate instead of the IP ? >>>>> or >>>>> Do you have any advice to get the SSL configuration to work on this >>>>> environment ? >>>>> >>>>> Thanks in advance. >>>>> >>>>> Regards, >>>>> Edward >>>>> >>>> >>>> >>>> >>>> -- >>>> Christophe >>>> >>> >>> >> >> >> -- >> *Edward Alexander Rojas Clavijo* >> >> >> >> *Software EngineerHybrid CloudIBM France* >> > >