Hi there, I’m trying to run my flink job on Kubernetes cluster, but when I try to give my job a larger parallelism (128) I get an error said “java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled.
We confirmed it cannot be a network issue, since: * We encounter this error every time we run this job with larger parallelism (128), but it’s OK with smaller parallelism (32/64). * We are using the k8s cluster in the production environment, and no other containers have the network problems. * When we give “heartbeat.timeout” a larger value like 300s, the error never occurs again. My settings and environment: * Flink 1.12.5 with java8, scala 2.11 * Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/jobmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=1073741824b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=15703474176b -D jobmanager.memory.jvm-overhead.max=1073741824b * Taskmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/taskmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=359703515b -D taskmanager.memory.network.min=359703515b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=1530082070b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.jvm-metaspace.size=268435456b -D taskmanager.memory.jvm-overhead.max=429496736b -D taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf -Djobmanager.rpc.address='10.50.132.154' -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' -Drest.address='10.50.132.154' -Djobmanager.memory.jvm-overhead.max='1073741824b' -Djobmanager.memory.jvm-overhead.min='1073741824b' -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17' -Dexecution.target='embedded' -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='15703474176b' Is this an expected behavior? Could you give me some guideline about how to troubleshot this issue? BRs Chenyu