Hi all, I'm observing an issue sometimes, and it's been hard to reproduce, where task managers are not able to register with the Flink cluster. We provision only the number of task managers required to run a given application, and so the absence of any of the task managers causes the job to enter a crash loop where it fails to get the required task slots.
The failure occurs after a job has been running for a while, and when there have been job and task manager restarts. We run in kubernetes so pod disruptions occur from time to time, however we're running using the high availability setup [0] Has anyone encountered this before? Any suggestions? Below are some error messages pulled from the task managers failing to re-register. ``` ] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='streaming-sales-model-staging-resourcemanager-leader'}. 2021-08-16 13:15:10,112 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='streaming-sales-model-staging-resourcemanager-leader'}. 2021-08-16 13:15:10,205 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 7f504af8-5f00-494e-8ecb-c3e67688bae8 for streaming-sales-model-staging-restserver-leader. 2021-08-16 13:15:10,205 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 7f504af8-5f00-494e-8ecb-c3e67688bae8 for streaming-sales-model-staging-resourcemanager-leader. 2021-08-16 13:15:10,205 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 7f504af8-5f00-494e-8ecb-c3e67688bae8 for streaming-sales-model-staging-dispatcher-leader. 2021-08-16 13:15:10,211 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='streaming-sales-model-staging-dispatcher-leader'}. 2021-08-16 13:16:26,103 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:6123/user/rpc/dispatcher_1. 2021-08-16 13:16:30,978 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:6123/user/rpc/dispatcher_1. ``` ``` 2021-08-15 14:02:21,078 ERROR org.apache.kafka.common.utils.KafkaThread [] - Uncaught exception in thread 'kafka-producer-network-thread | trickle-producer-monorail_sales_facts_non_recent_v0_1-1629035259075': java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1 at org.apache.kafka.clients.NetworkClient.processDisconnection(NetworkClient.java:748) ~[?:?] at org.apache.kafka.clients.NetworkClient.handleDisconnections(NetworkClient.java:899) ~[?:?] at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560) ~[?:?] at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) ~[?:?] at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) ~[?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: java.lang.ClassNotFoundException: org.apache.kafka.clients.NetworkClient$1 at java.net.URLClassLoader.findClass(Unknown Source) ~[?:?] at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?] at org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:64) ~[flink-dist_2.12-1.13.1-shopify-3491d59.jar:1.13.1-shopify-3491d59] at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:74) ~[flink-dist_2.12-1.13.1-shopify-3491d59.jar:1.13.1-shopify-3491d59] at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) ~[flink-dist_2.12-1.13.1-shopify-3491d59.jar:1.13.1-shopify-3491d59] at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?] ... 6 more ``` ``` connection to [null] failed with java.net.ConnectException: Connection refused: flink-jobmanager/10.28.65.100:6123 2021-08-16 13:14:59,668 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@flink-jobmanager:6123]] Caused by: [java.net.ConnectException: Connection refused: flink-jobmanager/ 10.28.65.100:6123] 2021-08-16 13:14:59,669 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0. ``` ``` 2021-08-15 16:55:13,222 ERROR org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution [] - Failed to submit a listener notification task. Event loop shut down? java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... ``` [0] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#high-availability-with-standalone-kubernetes