Rui Fan created FLINK-36513: ------------------------------- Summary: A lot of CI failures are caused by Install cert-manager Key: FLINK-36513 URL: https://issues.apache.org/jira/browse/FLINK-36513 Project: Flink Issue Type: Technical Debt Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.9.0 Reporter: Rui Fan Assignee: Rui Fan Attachments: image-2024-10-12-10-30-38-781.png
A lot of CI failures are caused by Install cert-manager, such as: [https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383] [https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397] h1. Root cause: I checked the raw log[1], the failure reason is : _Unable to connect to the server: dial tcp 140.82.113.3:443: i/o timeout._ !image-2024-10-12-10-30-38-781.png|width=2702,height=982! CI code: [https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227] It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the github ip+port. So download cert-manager.yaml is the root cause. h1. Solution: * Solution1: Introducing retry mechanism ** Download cert-manager.yaml first with retry mechanism ** Then kubectl apply -f local file * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo directly ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC. [1] https://productionresultssa9.blob.core.windows.net/actions-results/1c3ad627-91b6-4db3-a9ad-453109617470/workflow-job-run-3290c02c-bc49-582e-d1d1-63220f3fe3ce/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-10-12T02%3A29%3A01Z&sig=InSwCX86huA086rqGjAXM836sM8%2Bb8zk5%2FfeVJgmpsM%3D&ske=2024-10-12T13%3A37%3A00Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-10-12T01%3A37%3A00Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-10-12T02%3A18%3A56Z&sv=2024-08-04 -- This message was sent by Atlassian Jira (v8.20.10#820010)