By the way Fabian, any chance this issue is looked into / the PR considered for 1.5?
-- Christophe On Wed, Apr 4, 2018 at 2:41 PM, Fabian Hueske <fhue...@gmail.com> wrote: > Thank you Edward and Christophe! > > 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo < > edward.roja...@gmail.com>: > >> Hi all, >> >> I did some tests based on the PR Christophe mentioned above and by making >> a change on the NettyClient to use CanonicalHostName instead of >> HostNameAddress to identify the server, the SSL validation works!! >> >> I created a PR with this change: https://github.com/apa >> che/flink/pull/5789 >> >> Regards, >> Edward >> >> 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo < >> edward.roja...@gmail.com>: >> >>> Hi Till, >>> >>> I just created the JIRA ticket: https://issues.apache.org/jira >>> /browse/FLINK-9103 >>> >>> I added the JobManager and TaskManager logs, Hope this helps to resolve >>> the issue. >>> >>> Regards, >>> Edward >>> >>> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: >>> >>>> Hi Edward, >>>> >>>> could you please file a JIRA issue for this problem. It might be as >>>> simple as that the TaskManager's network stack uses the IP instead of the >>>> hostname as you suggested. But we have to look into this to be sure. Also >>>> the logs of the JobManager as well as the TaskManagers could be helpful. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> I suspect this relates to: https://issues.apache.org/ >>>>> jira/browse/FLINK-5030 >>>>> >>>>> For which there was a PR at some point but nothing has been done so >>>>> far. It seems the current code explicitly uses the IP vs Hostname for >>>>> Netty >>>>> SSL configuration. >>>>> >>>>> Without that I'm really wondering how people are reasonably using SSL >>>>> on a Kubernetes Flink-based cluster as every time a pod is (re-started) it >>>>> can theoretically take a different IP? Or do I miss something? >>>>> >>>>> -- >>>>> Christophe >>>>> >>>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo < >>>>> edward.roja...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with >>>>>> SSL configuration based on https://ci.apache.org/proje >>>>>> cts/flink/flink-docs-master/ops/security-ssl.html. >>>>>> >>>>>> However, as the IP of the nodes are dynamic (from the nature of >>>>>> kubernetes), we are using only the DNS which we can control using >>>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the >>>>>> flink-jobmanager DNS and also the DNS for the task managers >>>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form >>>>>> flink-taskmanager-0.flink-taskmanager-svc). >>>>>> >>>>>> Additionally we set the jobmanager.rpc.address property on all the >>>>>> nodes and each task manager sets the taskmanager.host property, all >>>>>> matching the ones on the certificate. >>>>>> >>>>>> This is working well when using Job with Parallelism set to 1. The >>>>>> SSL validations are good and the Jobmanager can communicate with Task >>>>>> manager and vice versa. >>>>>> >>>>>> But when we set the parallelism to more than 1 we have exceptions on >>>>>> the SSL validation like this: >>>>>> >>>>>> Caused by: java.security.cert.CertificateException: No subject >>>>>> alternative names matching IP address 172.30.247.163 found >>>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja >>>>>> va:168) >>>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) >>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>>>> tManagerImpl.java:455) >>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>>>> tManagerImpl.java:436) >>>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust >>>>>> ManagerImpl.java:252) >>>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50 >>>>>> 9TrustManagerImpl.java:136) >>>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa >>>>>> ndshaker.java:1601) >>>>>> ... 21 more >>>>>> >>>>>> >>>>>> From the logs I see the Jobmanager is correctly registering the >>>>>> taskmanagers: >>>>>> >>>>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl >>>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) >>>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered >>>>>> hosts is 3. Current number of alive task slots is 6. >>>>>> >>>>>> And also each taskmanager is correctly registered to use the hostname >>>>>> for communication: >>>>>> >>>>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager >>>>>> will use hostname/address 'flink-taskmanager-1.flink-tas >>>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for >>>>>> communication. >>>>>> ... >>>>>> akka.remote.Remoting - Remoting started; listening on addresses >>>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager >>>>>> -svc.default.svc.cluster.local:6122] >>>>>> ... >>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig - >>>>>> NettyConfig [server address: flink-taskmanager-1.flink-task >>>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port: >>>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport >>>>>> type: NIO, number of server threads: 2 (manual), number of client >>>>>> threads: >>>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client >>>>>> connect >>>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >>>>>> default)] >>>>>> ... >>>>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager >>>>>> data connection information: bf4a9b50e57c99c17049adb66d65f685 @ >>>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local >>>>>> (dataPort=6121) >>>>>> >>>>>> >>>>>> >>>>>> But even with that, it seems like the taskmanagers are using the IP >>>>>> communicate between them and the SSL validation fails. >>>>>> >>>>>> Do you know if it's possible to make the taskmanagers to use the >>>>>> hostname to communicate instead of the IP ? >>>>>> or >>>>>> Do you have any advice to get the SSL configuration to work on this >>>>>> environment ? >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>>> Regards, >>>>>> Edward >>>>>> >>>>>