[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839545#comment-17839545 ]
chenyuzhi edited comment on FLINK-35192 at 4/22/24 8:18 AM: ------------------------------------------------------------ the yaml spec: {code:yaml} apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" meta.helm.sh/release-name: flink-kubernetes-operator meta.helm.sh/release-namespace: streamfly creationTimestamp: "2024-03-13T02:55:09Z" generation: 3 labels: app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: flink-kubernetes-operator app.kubernetes.io/version: 1.6.1-GDC1.0.2 helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2 name: flink-kubernetes-operator namespace: streamfly resourceVersion: "8064936654" uid: 00418b62-820f-4e4a-a138-1ff81f605787 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/name: flink-kubernetes-operator strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/default-container: flink-kubernetes-operator creationTimestamp: null labels: app.kubernetes.io/name: flink-kubernetes-operator spec: containers: - command: - /docker-entrypoint.sh - operator env: - name: OPERATOR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: HOST_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.hostIP - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: OPERATOR_NAME value: flink-kubernetes-operator - name: FLINK_CONF_DIR value: /opt/flink/conf - name: FLINK_PLUGINS_DIR value: /opt/flink/plugins - name: LOG_CONFIG value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties - name: JVM_ARGS value: -Xmx32g -Xms32g -XX:+UseG1GC - name: TZ value: Asia/Shanghai image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: / port: health-port scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: flink-kubernetes-operator ports: - containerPort: 9999 name: metrics protocol: TCP - containerPort: 8085 name: health-port protocol: TCP resources: limits: cpu: "10" memory: 35Gi requests: cpu: "10" memory: 35Gi securityContext: {} startupProbe: failureThreshold: 30 httpGet: path: / port: health-port scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /opt/flink/conf name: flink-operator-config-volume - mountPath: /opt/scheduler/keytab name: flink-operator-keytab-volume - mountPath: /flink-data name: flink-operator-logs-volume - command: - /docker-entrypoint.sh - webhook env: - name: WEBHOOK_KEYSTORE_PASSWORD valueFrom: secretKeyRef: key: password name: flink-operator-webhook-secret - name: WEBHOOK_KEYSTORE_FILE value: /certs/keystore.p12 - name: WEBHOOK_KEYSTORE_TYPE value: pkcs12 - name: WEBHOOK_SERVER_PORT value: "9443" - name: LOG_CONFIG value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties - name: JVM_ARGS - name: FLINK_CONF_DIR value: /opt/flink/conf - name: FLINK_PLUGINS_DIR value: /opt/flink/plugins - name: OPERATOR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2 imagePullPolicy: IfNotPresent name: flink-webhook resources: {} securityContext: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /certs name: keystore readOnly: true - mountPath: /opt/flink/conf name: flink-operator-config-volume dnsPolicy: ClusterFirst imagePullSecrets: - name: ncr-pull-secret nodeSelector: node-role.kubernetes.io/edge: "" restartPolicy: Always schedulerName: default-scheduler securityContext: runAsGroup: 0 runAsUser: 0 serviceAccount: flink-operator serviceAccountName: flink-operator terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 items: - key: flink-conf.yaml path: flink-conf.yaml - key: log4j-operator.properties path: log4j-operator.properties - key: log4j-console.properties path: log4j-console.properties name: flink-operator-config name: flink-operator-config-volume - hostPath: path: /cfs/flink/keytab type: Directory name: flink-operator-keytab-volume - hostPath: path: /home/k8s/logs type: DirectoryOrCreate name: flink-operator-logs-volume - name: keystore secret: defaultMode: 420 items: - key: keystore.p12 path: keystore.p12 secretName: webhook-server-cert status: availableReplicas: 2 conditions: - lastTransitionTime: "2024-03-13T02:55:09Z" lastUpdateTime: "2024-03-19T06:48:09Z" message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2024-04-19T04:08:21Z" lastUpdateTime: "2024-04-19T04:08:21Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 3 readyReplicas: 2 replicas: 2 updatedReplicas: 2 {code} and try to analyze the heap memory using MAT, here are the results of the analysis !screenshot-2.png! It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit. It's strange was (Author: stupid_pig): the yaml spec: {code:yaml} apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" meta.helm.sh/release-name: flink-kubernetes-operator meta.helm.sh/release-namespace: streamfly creationTimestamp: "2024-03-13T02:55:09Z" generation: 3 labels: app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: flink-kubernetes-operator app.kubernetes.io/version: 1.6.1-GDC1.0.2 helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2 name: flink-kubernetes-operator namespace: streamfly resourceVersion: "8064936654" uid: 00418b62-820f-4e4a-a138-1ff81f605787 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/name: flink-kubernetes-operator strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/default-container: flink-kubernetes-operator creationTimestamp: null labels: app.kubernetes.io/name: flink-kubernetes-operator spec: containers: - command: - /docker-entrypoint.sh - operator env: - name: OPERATOR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: HOST_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.hostIP - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: OPERATOR_NAME value: flink-kubernetes-operator - name: FLINK_CONF_DIR value: /opt/flink/conf - name: FLINK_PLUGINS_DIR value: /opt/flink/plugins - name: LOG_CONFIG value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties - name: JVM_ARGS value: -Xmx32g -Xms32g -XX:+UseG1GC - name: TZ value: Asia/Shanghai image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: / port: health-port scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: flink-kubernetes-operator ports: - containerPort: 9999 name: metrics protocol: TCP - containerPort: 8085 name: health-port protocol: TCP resources: limits: cpu: "10" memory: 35Gi requests: cpu: "10" memory: 35Gi securityContext: {} startupProbe: failureThreshold: 30 httpGet: path: / port: health-port scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /opt/flink/conf name: flink-operator-config-volume - mountPath: /opt/scheduler/keytab name: flink-operator-keytab-volume - mountPath: /flink-data name: flink-operator-logs-volume - command: - /docker-entrypoint.sh - webhook env: - name: WEBHOOK_KEYSTORE_PASSWORD valueFrom: secretKeyRef: key: password name: flink-operator-webhook-secret - name: WEBHOOK_KEYSTORE_FILE value: /certs/keystore.p12 - name: WEBHOOK_KEYSTORE_TYPE value: pkcs12 - name: WEBHOOK_SERVER_PORT value: "9443" - name: LOG_CONFIG value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties - name: JVM_ARGS - name: FLINK_CONF_DIR value: /opt/flink/conf - name: FLINK_PLUGINS_DIR value: /opt/flink/plugins - name: OPERATOR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2 imagePullPolicy: IfNotPresent name: flink-webhook resources: {} securityContext: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /certs name: keystore readOnly: true - mountPath: /opt/flink/conf name: flink-operator-config-volume dnsPolicy: ClusterFirst imagePullSecrets: - name: ncr-pull-secret nodeSelector: node-role.kubernetes.io/edge: "" restartPolicy: Always schedulerName: default-scheduler securityContext: runAsGroup: 0 runAsUser: 0 serviceAccount: flink-operator serviceAccountName: flink-operator terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 items: - key: flink-conf.yaml path: flink-conf.yaml - key: log4j-operator.properties path: log4j-operator.properties - key: log4j-console.properties path: log4j-console.properties name: flink-operator-config name: flink-operator-config-volume - hostPath: path: /cfs/flink/keytab type: Directory name: flink-operator-keytab-volume - hostPath: path: /home/k8s/logs type: DirectoryOrCreate name: flink-operator-logs-volume - name: keystore secret: defaultMode: 420 items: - key: keystore.p12 path: keystore.p12 secretName: webhook-server-cert status: availableReplicas: 2 conditions: - lastTransitionTime: "2024-03-13T02:55:09Z" lastUpdateTime: "2024-03-19T06:48:09Z" message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2024-04-19T04:08:21Z" lastUpdateTime: "2024-04-19T04:08:21Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 3 readyReplicas: 2 replicas: 2 updatedReplicas: 2 {code} and try to analyze the heap memory using MAT, here are the results of the analysis !screenshot-2.png! It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit. > operator oom > ------------ > > Key: FLINK-35192 > URL: https://issues.apache.org/jira/browse/FLINK-35192 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Environment: jdk: openjdk11 > operator version: 1.6.1 > Reporter: chenyuzhi > Priority: Major > Attachments: image-2024-04-22-15-47-49-455.png, > image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, > image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png > > > The kubernetest operator docker process was killed by kernel cause out of > memory(the time is 2024.04.03: 18:16) > !image-2024-04-22-15-47-49-455.png! > Metrics: > the pod memory (RSS) is increasing slowly in the past 7 days: > !screenshot-1.png! > However the jvm memory metrics of operator not shown obvious anomaly: > !image-2024-04-22-15-58-23-269.png! > !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)