Rui Fan created FLINK-36513:
-------------------------------
Summary: A lot of CI failures are caused by Install cert-manager
Key: FLINK-36513
URL: https://issues.apache.org/jira/browse/FLINK-36513
Project: Flink
Issue Type: Technical Debt
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.9.0
Reporter: Rui Fan
Assignee: Rui Fan
Attachments: image-2024-10-12-10-30-38-781.png
A lot of CI failures are caused by Install cert-manager, such as:
[https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383]
[https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397]
h1. Root cause:
I checked the raw log[1], the failure reason is : _Unable to connect to the
server: dial tcp 140.82.113.3:443: i/o timeout._
!image-2024-10-12-10-30-38-781.png|width=2702,height=982!
CI code:
[https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227]
It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the
github ip+port. So download cert-manager.yaml is the root cause.
h1. Solution:
* Solution1: Introducing retry mechanism
** Download cert-manager.yaml first with retry mechanism
** Then kubectl apply -f local file
* Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo
directly
** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC.
[1]
https://productionresultssa9.blob.core.windows.net/actions-results/1c3ad627-91b6-4db3-a9ad-453109617470/workflow-job-run-3290c02c-bc49-582e-d1d1-63220f3fe3ce/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-10-12T02%3A29%3A01Z&sig=InSwCX86huA086rqGjAXM836sM8%2Bb8zk5%2FfeVJgmpsM%3D&ske=2024-10-12T13%3A37%3A00Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-10-12T01%3A37%3A00Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-10-12T02%3A18%3A56Z&sv=2024-08-04
--
This message was sent by Atlassian Jira
(v8.20.10#820010)