[jira] [Comment Edited] (FLINK-36370) Flink 1.18 fails with Empty server certificate chain when High Availability and mTLS both enabled

Aniruddh J (Jira) Mon, 14 Oct 2024 08:11:20 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-36370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889255#comment-17889255
 ]


Aniruddh J edited comment on FLINK-36370 at 10/14/24 3:09 PM:
--------------------------------------------------------------

Hi [~mapohl] 
{quote}Hi [~aniruddhj], that sounds like it's a Kubernetes Operator issue and 
how it's setting up the communication with Flink? Correct me if I'm wrong here.
{quote}
I am assuming the same, yes. 
----
[~mapohl] [~gyfora] please find additional observations and data below:

 
 - From our initial observation I observed the client set up fails (on operator 
side) since *observeInternal* method within *AbstractFlinkDeploymentObserver* 
class doesn't get successful in the the first check for JM deployment ready 
[https://github.com/apache/flink-kubernetes-operator/blob/b081b75b72ddde643710e869b95b214912882363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L64]

 
 - We used ksniff to retrieve tcpdump in order to better understand which all 
services/pods communicate with flink. There seems to be an IP (10.254.20.2) 
trying to communicate with flink on 8081. When tried tracking the IP across my 
kubernetes environment I couldn't find any. (wireshark screenshot below)

!Screenshot 2024-10-03 at 11.42.52 AM.png|width=1059,height=320!

 
 - When we enabled only HA with mTLS setting something like below

{code:java}
high-availability.type: kubernetes
high-availability.storageDir: 
'file:///mnt/pv/ha'security.ssl.rest.authentication-enabled: 'false'{code}

we observed a similar pattern where an IP (10.254.13.81) is trying to connect 
with the Flink (similar to above HA mTLS both enabled) but since client 
authentication is not required it doesn't bother and skips without any error?

!Screenshot 2024-10-14 at 10.41.00 AM.png|width=1095,height=446!

 
 - I am assuming in both cases (HA only and HA with mTLS enabled) this is 
something that happens within Flink, before the client is successfully setup on 
the operator side. But in case of only HA being enabled this issue goes away 
when the client is successfully setup which does not happen when both HA and 
mTLS are enabled? 

Attaching logs for both flink-main-container (operand pod) and 
flink-kubernetes-operator (operator pod) in case it can help in troubleshooting 
the problem better.

[^flink-ssl-66c8dfbcc7-l725q-flink-main-container.log]

[^flink-kubernetes-operator-54b9b99bd5-hkh8q-flink-kubernetes-operator.log]
h4.  

 


was (Author: JIRAUSER307119):
Hi [~mapohl] 
{quote}Hi [~aniruddhj], that sounds like it's a Kubernetes Operator issue and 
how it's setting up the communication with Flink? Correct me if I'm wrong here.
{quote}
I am assuming the same, yes. 
----
[~mapohl] [~gyfora] please find additional observations and data below:

 
 - From our initial observation I observed the client set up fails (on operator 
side) since *observeInternal* method within *AbstractFlinkDeploymentObserver* 
class doesn't get successful in the the first check for JM deployment ready 
[https://github.com/apache/flink-kubernetes-operator/blob/b081b75b72ddde643710e869b95b214912882363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L64]

 
 - We used ksniff to retrieve tcpdump in order to better understand which all 
services/pods communicate with flink. There seems to be an IP (10.254.20.2) 
trying to communicate with flink on 8081. When tried tracking the IP across my 
kubernetes environment I couldn't find any. (wireshark screenshot below)

!Screenshot 2024-10-03 at 11.42.52 AM.png|width=1059,height=320!

 

- When we enabled only HA with mTLS setting something like below
high-availability.type: kubernetes
high-availability.storageDir: 
'file:///mnt/pv/ha'security.ssl.rest.authentication-enabled: 'false'
we observed a similar pattern where an IP (10.254.13.81) is trying to connect 
with the Flink (similar to above HA mTLS both enabled) but since client 
authentication is not required it doesn't bother and skips without any error?

!Screenshot 2024-10-14 at 10.41.00 AM.png|width=1095,height=446!

 
 - I am assuming in both cases (HA only and HA with mTLS enabled) this is 
something that happens within Flink, before the client is successfully setup on 
the operator side. But in case of only HA being enabled this issue goes away 
when the client is successfully setup which does not happen when both HA and 
mTLS are enabled? 

Attaching logs for both flink-main-container (operand pod) and 
flink-kubernetes-operator (operator pod) in case it can help in troubleshooting 
the problem better.

[^flink-ssl-66c8dfbcc7-l725q-flink-main-container.log]

[^flink-kubernetes-operator-54b9b99bd5-hkh8q-flink-kubernetes-operator.log]
h4.  

 

> Flink 1.18 fails with Empty server certificate chain when High Availability 
> and mTLS both enabled
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-36370
>                 URL: https://issues.apache.org/jira/browse/FLINK-36370
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator, Runtime / Coordination
>    Affects Versions: kubernetes-operator-1.7.0, 1.18.1
>            Reporter: Aniruddh J
>            Priority: Major
>         Attachments: Screenshot 2024-10-03 at 11.42.52 AM.png, Screenshot 
> 2024-10-14 at 10.41.00 AM.png, flink-cert-issue.log, 
> flink-kubernetes-operator-54b9b99bd5-hkh8q-flink-kubernetes-operator.log, 
> flink-ssl-66c8dfbcc7-l725q-flink-main-container.log
>
>
> Hi, in my kubernetes cluster I have flink-kubernetes-operator v1.7.0 and 
> apache-flink v1.18.1 installed. In the FlinkDeployment CR when I enable 
> Kubernetes high availability services with mTLS something like below:
> {code:java}
> high-availability.type: kubernetes
> high-availability: 
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> high-availability.storageDir: 'file:///mnt/pv/ha'
> security.ssl.rest.authentication-enabled: 'true'{code}
> I am ending up with *SSLHandshakeException with empty client certificate*
>  
> Though both of them work fine when implemented individually. Upon enabling  
> *{{{}-{}}}{{{}[Djavax.net|http://djavax.net/]{}}}{{{}.debug=all{}}}* observed 
> client server communication and figured out  
> [https://github.com/apache/flink/blob/release-1.18/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java]
>  is where Client gets setup and it happens from the operator side 
> [https://github.com/apache/flink-kubernetes-operator/blob/b081b75b72ddde643710e869b95b214912882363/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L750]
>  (correct me here please)
>  
> When we enable both mTLS and HA the client doesn't seem to be getting setup. 
> Not only that, it doesn't follow the same path of client creation. Below is 
> the part of the ssl handshake log before getting the error (attached the 
> entire ssl handshake log):
> {code:java}
> javax.net.ssl|DEBUG|53|flink-rest-server-netty-worker-thread-1|2024-09-19 
> 15:16:12.508 GMT|null:-1|Produced CertificateRequest handshake message (
> "CertificateRequest":
> { "certificate types": [ecdsa_sign, rsa_sign, dss_sign] "supported signature 
> algorithms": [ecdsa_secp256r1_sha256, .., rsa_sha224, dsa_sha224, ecdsa_sha1, 
> rsa_pkcs1_sha1, dsa_sha1] "certificate authorities": [CN=FlinkCA, O=Apache 
> Flink] }
> javax.net.ssl|DEBUG|53|flink-rest-server-netty-worker-thread-1|2024-09-19 
> 15:16:12.512 GMT|null:-1|Raw read (
> 0000: 1603030007 0B 000003000000 ............
> )
> javax.net.ssl|DEBUG|53|flink-rest-server-netty-worker-thread-1|2024-09-19 
> 15:16:12.513 GMT|null:-1|READ: TLSv1.2 handshake, length = 7
> javax.net.ssl|DEBUG|53|flink-rest-server-netty-worker-thread-1|2024-09-19 
> 15:16:12.513 GMT|null:-1|Consuming client Certificate handshake message (
> "Certificates": <empty list>
> )
> javax.net.ssl|ERROR|53|flink-rest-server-netty-worker-thread-1|2024-09-19 
> 15:16:12.514 GMT|null:-1|Fatal (BAD_CERTIFICATE): Empty server certificate 
> chain (
> "throwable" : {
> javax.net.ssl.SSLHandshakeException: Empty server certificate chain
> {code}
> From the initial looks it seems when Flink server is requesting for 
> certificates from Client, the client doesn't send anything back since it does 
> not have certificates  matching the CA?
>  
> Some client is sending a REST request to Flink server which the netty library 
> is handling but until we figure out the client we don't know whether it's the 
> truststore on client that's a problem or something else we don't see here.
>  
> *Note: The certficates for Flink are self-signed certificates.*
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-36370) Flink 1.18 fails with Empty server certificate chain when High Availability and mTLS both enabled

Reply via email to