By the way Fabian, any chance this issue is looked into / the PR considered
for 1.5?

--
Christophe

On Wed, Apr 4, 2018 at 2:41 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> Thank you Edward and Christophe!
>
> 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <
> edward.roja...@gmail.com>:
>
>> Hi all,
>>
>> I did some tests based on the PR Christophe mentioned above and by making
>> a change on the NettyClient to use CanonicalHostName instead of
>> HostNameAddress to identify the server, the SSL validation works!!
>>
>> I created a PR with this change: https://github.com/apa
>> che/flink/pull/5789
>>
>> Regards,
>> Edward
>>
>> 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <
>> edward.roja...@gmail.com>:
>>
>>> Hi Till,
>>>
>>> I just created the JIRA ticket: https://issues.apache.org/jira
>>> /browse/FLINK-9103
>>>
>>> I added the JobManager and TaskManager logs, Hope this helps to resolve
>>> the issue.
>>>
>>> Regards,
>>> Edward
>>>
>>> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>:
>>>
>>>> Hi Edward,
>>>>
>>>> could you please file a JIRA issue for this problem. It might be as
>>>> simple as that the TaskManager's network stack uses the IP instead of the
>>>> hostname as you suggested. But we have to look into this to be sure. Also
>>>> the logs of the JobManager as well as the TaskManagers could be helpful.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> I suspect this relates to: https://issues.apache.org/
>>>>> jira/browse/FLINK-5030
>>>>>
>>>>> For which there was a PR at some point but nothing has been done so
>>>>> far. It seems the current code explicitly uses the IP vs Hostname for 
>>>>> Netty
>>>>> SSL configuration.
>>>>>
>>>>> Without that I'm really wondering how people are reasonably using SSL
>>>>> on a Kubernetes Flink-based cluster as every time a pod is (re-started) it
>>>>> can theoretically take a different IP? Or do I miss something?
>>>>>
>>>>> --
>>>>> Christophe
>>>>>
>>>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>>>>> edward.roja...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with
>>>>>> SSL configuration based on https://ci.apache.org/proje
>>>>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>>>>
>>>>>> However, as the IP of the nodes are dynamic (from the nature of
>>>>>> kubernetes), we are using only the DNS which we can control using
>>>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>>>>> flink-jobmanager DNS and also the DNS for the task managers
>>>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>>>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>>>>
>>>>>> Additionally we set the jobmanager.rpc.address property on all the
>>>>>> nodes and each task manager sets the taskmanager.host property, all
>>>>>> matching the ones on the certificate.
>>>>>>
>>>>>> This is working well when using Job with Parallelism set to 1. The
>>>>>> SSL validations are good and the Jobmanager can communicate with Task
>>>>>> manager and vice versa.
>>>>>>
>>>>>> But when we set the parallelism to more than 1 we have exceptions on
>>>>>> the SSL validation like this:
>>>>>>
>>>>>> Caused by: java.security.cert.CertificateException: No subject
>>>>>> alternative names matching IP address 172.30.247.163 found
>>>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja
>>>>>> va:168)
>>>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>>> tManagerImpl.java:455)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>>> tManagerImpl.java:436)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>>>>> ManagerImpl.java:252)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>>>>> 9TrustManagerImpl.java:136)
>>>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>>>>> ndshaker.java:1601)
>>>>>> ... 21 more
>>>>>>
>>>>>>
>>>>>> From the logs I see the Jobmanager is correctly registering the
>>>>>> taskmanagers:
>>>>>>
>>>>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered
>>>>>> hosts is 3. Current number of alive task slots is 6.
>>>>>>
>>>>>> And also each taskmanager is correctly registered to use the hostname
>>>>>> for communication:
>>>>>>
>>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager
>>>>>> will use hostname/address 'flink-taskmanager-1.flink-tas
>>>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>>>>> communication.
>>>>>> ...
>>>>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>>>>> -svc.default.svc.cluster.local:6122]
>>>>>> ...
>>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig   -
>>>>>> NettyConfig [server address: flink-taskmanager-1.flink-task
>>>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>>>>> type: NIO, number of server threads: 2 (manual), number of client 
>>>>>> threads:
>>>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client 
>>>>>> connect
>>>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>>>> default)]
>>>>>> ...
>>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager
>>>>>> data connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>>>>> (dataPort=6121)
>>>>>>
>>>>>>
>>>>>>
>>>>>> But even with that, it seems like the taskmanagers are using the IP
>>>>>> communicate between them and the SSL validation fails.
>>>>>>
>>>>>> Do you know if it's possible to make the taskmanagers to use the
>>>>>> hostname to communicate instead of the IP ?
>>>>>> or
>>>>>> Do you have any advice to get the SSL configuration to work on this
>>>>>> environment ?
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Regards,
>>>>>> Edward
>>>>>>
>>>>>

Reply via email to