I found this configuration which control jvm exit. Let me try it out:

taskmanager.jvm-exit-on-oom: true


> On Jun 24, 2021, at 8:07 AM, Fritz Budiyanto <fbudi...@icloud.com> wrote:
> 
> Hi,
> 
> 
> I have a job which kept on restarting due to a bug, and it brought down the 
> task manager with it due to OOM Metaspace. Please ignore the memory leak for 
> a moment, the problem here is the task manager does not restart and hung 
> which reduce the overall slots capacity. We are running Flink in Kubernetes 
> and task manager stuck is a problem as it reduces cluster capacity.
> Any idea how to make task manager restart upon OOM ? Here is the logs of task 
> manager:
> 
> 
> 2021-06-24 08:59:30,437 INFO  org.apache.kafka.clients.producer.KafkaProducer 
>              [] - [Producer clientId=producer-2] Closing the Kafka producer 
> with timeoutMillis = 0 ms.
> 2021-06-24 08:59:30,437 INFO  org.apache.kafka.clients.producer.KafkaProducer 
>              [] - [Producer clientId=producer-2] Proceeding to force close 
> the producer since pending requests could not be completed within timeout 0 
> ms.
> 2021-06-24 08:59:32,200 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler      [] - FATAL: 
> Thread 'AsyncOperations-thread-8' produced an uncaught exception. Stopping 
> the process...
> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has 
> occurred. This can mean two things: either the job requires a larger size of 
> JVM metaspace to load classes or there is a class loading leak. In the first 
> case 'taskmanager.memory.jvm-metaspace.size' configuration option should be 
> increased. If the error persists (usually in cluster after several job 
> (re-)submissions) then there is probably a class loading leak in user code or 
> some of its dependencies which has to be investigated and fixed. The task 
> executor has to be shutdown...
> 2021-06-24 08:59:41,759 INFO  
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - 
> Shutting down TaskExecutorLocalStateStoresManager.
> 2021-06-24 08:59:46,560 INFO  
> org.apache.flink.runtime.blob.TransientBlobCache             [] - Shutting 
> down BLOB cache
> 2021-06-24 08:59:45,414 INFO  
> org.apache.flink.runtime.blob.PermanentBlobCache             [] - Shutting 
> down BLOB cache
> 2021-06-24 08:59:59,383 INFO  org.apache.flink.runtime.filecache.FileCache    
>              [] - removed file cache directory 
> /tmp/flink-dist-cache-bb7818d6-a691-4121-9fdb-670b7d4182c5
> 
> —
> Fritz
> 

Reply via email to