Hi,
I have a job which kept on restarting due to a bug, and it brought down the task manager with it due to OOM Metaspace. Please ignore the memory leak for a moment, the problem here is the task manager does not restart and hung which reduce the overall slots capacity. We are running Flink in Kubernetes and task manager stuck is a problem as it reduces cluster capacity. Any idea how to make task manager restart upon OOM ? Here is the logs of task manager: 2021-06-24 08:59:30,437 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-2] Closing the Kafka producer with timeoutMillis = 0 ms. 2021-06-24 08:59:30,437 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-2] Proceeding to force close the producer since pending requests could not be completed within timeout 0 ms. 2021-06-24 08:59:32,200 ERROR org.apache.flink.runtime.util.FatalExitExceptionHandler [] - FATAL: Thread 'AsyncOperations-thread-8' produced an uncaught exception. Stopping the process... java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... 2021-06-24 08:59:41,759 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager. 2021-06-24 08:59:46,560 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache 2021-06-24 08:59:45,414 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache 2021-06-24 08:59:59,383 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-bb7818d6-a691-4121-9fdb-670b7d4182c5 — Fritz