[ https://issues.apache.org/jira/browse/FLINK-36513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rui Fan updated FLINK-36513: ---------------------------- Description: A lot of CI failures are caused by Install cert-manager, such as: [https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383] [https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397] h1. Root cause: I checked the raw log[1], the failure reason is : _Unable to connect to the server: dial tcp 140.82.113.3:443: i/o timeout._ !image-2024-10-12-10-30-38-781.png|width=1354,height=492! CI code: [https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227] It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the github ip+port. So download cert-manager.yaml is the root cause. h1. Solution: * Solution1: Introducing retry mechanism ** Download cert-manager.yaml first with retry mechanism ** Then kubectl apply -f local file * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo directly ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC. [1] [https://github.com/apache/flink-kubernetes-operator/commit/d2c01737c745979c6aadb670334565ee11aa2f4a/checks/31436330397/logs] was: A lot of CI failures are caused by Install cert-manager, such as: [https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383] [https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397] h1. Root cause: I checked the raw log[1], the failure reason is : _Unable to connect to the server: dial tcp 140.82.113.3:443: i/o timeout._ !image-2024-10-12-10-30-38-781.png|width=2702,height=982! CI code: [https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227] It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is the github ip+port. So download cert-manager.yaml is the root cause. h1. Solution: * Solution1: Introducing retry mechanism * ** Download cert-manager.yaml first with retry mechanism ** Then kubectl apply -f local file * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo directly ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC. [1] https://github.com/apache/flink-kubernetes-operator/commit/d2c01737c745979c6aadb670334565ee11aa2f4a/checks/31436330397/logs > A lot of CI failures are caused by Install cert-manager > ------------------------------------------------------- > > Key: FLINK-36513 > URL: https://issues.apache.org/jira/browse/FLINK-36513 > Project: Flink > Issue Type: Technical Debt > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.9.0 > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Attachments: image-2024-10-12-10-30-38-781.png > > > A lot of CI failures are caused by Install cert-manager, such as: > [https://github.com/apache/flink-kubernetes-operator/actions/runs/11292388626/job/31408603383] > [https://github.com/apache/flink-kubernetes-operator/actions/runs/11294831791/job/31436330397] > h1. Root cause: > I checked the raw log[1], the failure reason is : _Unable to connect to the > server: dial tcp 140.82.113.3:443: i/o timeout._ > !image-2024-10-12-10-30-38-781.png|width=1354,height=492! > > CI code: > [https://github.com/apache/flink-kubernetes-operator/blob/d2c01737c745979c6aadb670334565ee11aa2f4a/.github/workflows/ci.yml#L227] > > It needs to download cert-manager.yaml from github, and 140.82.113.3:443 is > the github ip+port. So download cert-manager.yaml is the root cause. > > h1. Solution: > * Solution1: Introducing retry mechanism > ** Download cert-manager.yaml first with retry mechanism > ** Then kubectl apply -f local file > * Solution2: Put the cert-manager.yaml on the flink-kubernetes-operator repo > directly > ** I saw it's a fixed version, so the cert-manager.yaml is immutable IIUC. > > [1] > [https://github.com/apache/flink-kubernetes-operator/commit/d2c01737c745979c6aadb670334565ee11aa2f4a/checks/31436330397/logs] > -- This message was sent by Atlassian Jira (v8.20.10#820010)