I found this configuration which control jvm exit. Let me try it out: taskmanager.jvm-exit-on-oom: true
> On Jun 24, 2021, at 8:07 AM, Fritz Budiyanto <fbudi...@icloud.com> wrote: > > Hi, > > > I have a job which kept on restarting due to a bug, and it brought down the > task manager with it due to OOM Metaspace. Please ignore the memory leak for > a moment, the problem here is the task manager does not restart and hung > which reduce the overall slots capacity. We are running Flink in Kubernetes > and task manager stuck is a problem as it reduces cluster capacity. > Any idea how to make task manager restart upon OOM ? Here is the logs of task > manager: > > > 2021-06-24 08:59:30,437 INFO org.apache.kafka.clients.producer.KafkaProducer > [] - [Producer clientId=producer-2] Closing the Kafka producer > with timeoutMillis = 0 ms. > 2021-06-24 08:59:30,437 INFO org.apache.kafka.clients.producer.KafkaProducer > [] - [Producer clientId=producer-2] Proceeding to force close > the producer since pending requests could not be completed within timeout 0 > ms. > 2021-06-24 08:59:32,200 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler [] - FATAL: > Thread 'AsyncOperations-thread-8' produced an uncaught exception. Stopping > the process... > java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has > occurred. This can mean two things: either the job requires a larger size of > JVM metaspace to load classes or there is a class loading leak. In the first > case 'taskmanager.memory.jvm-metaspace.size' configuration option should be > increased. If the error persists (usually in cluster after several job > (re-)submissions) then there is probably a class loading leak in user code or > some of its dependencies which has to be investigated and fixed. The task > executor has to be shutdown... > 2021-06-24 08:59:41,759 INFO > org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - > Shutting down TaskExecutorLocalStateStoresManager. > 2021-06-24 08:59:46,560 INFO > org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting > down BLOB cache > 2021-06-24 08:59:45,414 INFO > org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting > down BLOB cache > 2021-06-24 08:59:59,383 INFO org.apache.flink.runtime.filecache.FileCache > [] - removed file cache directory > /tmp/flink-dist-cache-bb7818d6-a691-4121-9fdb-670b7d4182c5 > > — > Fritz >