[ 
https://issues.apache.org/jira/browse/FLINK-36513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-36513:
----------------------------
    Description: 
A lot of CI failures are caused by Install cert-manager, such as:

[https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383]

[https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397]
h1. Root cause:

I checked the raw log[1], the failure reason is : _Unable to connect to the 
server: dial tcp 140.82.113.3:443: i/o timeout._

!image-2024-10-12-10-30-38-781.png|width=2702,height=982!

 

CI code: 
[https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227]

 

It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the 
github ip+port. So download cert-manager.yaml is the root cause.

 
h1. Solution:
 * Solution1: Introducing retry mechanism

 * 
 ** Download cert-manager.yaml first with retry mechanism
 ** Then kubectl apply -f local file
 * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo 
directly
 ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC.

 

[1] 
[https://github.com/apache/flink-kubernetes-operator/commit/d2c01737c745979c6aadb670334565ee11aa2f4a/checks/31436330397/logs|https://productionresultssa9.blob.core.windows.net/actions-results/1c3ad627-91b6-4db3-a9ad-453109617470/workflow-job-run-3290c02c-bc49-582e-d1d1-63220f3fe3ce/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-10-12T02%3A29%3A01Z&sig=InSwCX86huA086rqGjAXM836sM8%2Bb8zk5%2FfeVJgmpsM%3D&ske=2024-10-12T13%3A37%3A00Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-10-12T01%3A37%3A00Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-10-12T02%3A18%3A56Z&sv=2024-08-04]

 

  was:
A lot of CI failures are caused by Install cert-manager, such as:

[https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383]

[https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397]
h1. Root cause:

I checked the raw log[1], the failure reason is : _Unable to connect to the 
server: dial tcp 140.82.113.3:443: i/o timeout._

!image-2024-10-12-10-30-38-781.png|width=2702,height=982!

 

CI code: 
[https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227]

 

It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the 
github ip+port. So download cert-manager.yaml is the root cause.

 
h1. Solution:
 * Solution1: Introducing retry mechanism

 ** Download cert-manager.yaml first with retry mechanism
 ** Then kubectl apply -f local file
 * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo 
directly
 ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC.

 

[1] 
https://productionresultssa9.blob.core.windows.net/actions-results/1c3ad627-91b6-4db3-a9ad-453109617470/workflow-job-run-3290c02c-bc49-582e-d1d1-63220f3fe3ce/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-10-12T02%3A29%3A01Z&sig=InSwCX86huA086rqGjAXM836sM8%2Bb8zk5%2FfeVJgmpsM%3D&ske=2024-10-12T13%3A37%3A00Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-10-12T01%3A37%3A00Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-10-12T02%3A18%3A56Z&sv=2024-08-04

 


> A lot of CI failures are caused by Install cert-manager
> -------------------------------------------------------
>
>                 Key: FLINK-36513
>                 URL: https://issues.apache.org/jira/browse/FLINK-36513
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.9.0
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>         Attachments: image-2024-10-12-10-30-38-781.png
>
>
> A lot of CI failures are caused by Install cert-manager, such as:
> [https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383]
> [https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397]
> h1. Root cause:
> I checked the raw log[1], the failure reason is : _Unable to connect to the 
> server: dial tcp 140.82.113.3:443: i/o timeout._
> !image-2024-10-12-10-30-38-781.png|width=2702,height=982!
>  
> CI code: 
> [https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227]
>  
> It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is 
> the github ip+port. So download cert-manager.yaml is the root cause.
>  
> h1. Solution:
>  * Solution1: Introducing retry mechanism
>  * 
>  ** Download cert-manager.yaml first with retry mechanism
>  ** Then kubectl apply -f local file
>  * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo 
> directly
>  ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC.
>  
> [1] 
> [https://github.com/apache/flink-kubernetes-operator/commit/d2c01737c745979c6aadb670334565ee11aa2f4a/checks/31436330397/logs|https://productionresultssa9.blob.core.windows.net/actions-results/1c3ad627-91b6-4db3-a9ad-453109617470/workflow-job-run-3290c02c-bc49-582e-d1d1-63220f3fe3ce/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-10-12T02%3A29%3A01Z&sig=InSwCX86huA086rqGjAXM836sM8%2Bb8zk5%2FfeVJgmpsM%3D&ske=2024-10-12T13%3A37%3A00Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-10-12T01%3A37%3A00Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-10-12T02%3A18%3A56Z&sv=2024-08-04]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to