Hi,

I have a job which kept on restarting due to a bug, and it brought down the 
task manager with it due to OOM Metaspace. Please ignore the memory leak for a 
moment, the problem here is the task manager does not restart and hung which 
reduce the overall slots capacity. We are running Flink in Kubernetes and task 
manager stuck is a problem as it reduces cluster capacity.
Any idea how to make task manager restart upon OOM ? Here is the logs of task 
manager:


2021-06-24 08:59:30,437 INFO  org.apache.kafka.clients.producer.KafkaProducer   
           [] - [Producer clientId=producer-2] Closing the Kafka producer with 
timeoutMillis = 0 ms.
2021-06-24 08:59:30,437 INFO  org.apache.kafka.clients.producer.KafkaProducer   
           [] - [Producer clientId=producer-2] Proceeding to force close the 
producer since pending requests could not be completed within timeout 0 ms.
2021-06-24 08:59:32,200 ERROR 
org.apache.flink.runtime.util.FatalExitExceptionHandler      [] - FATAL: Thread 
'AsyncOperations-thread-8' produced an uncaught exception. Stopping the 
process...
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has 
occurred. This can mean two things: either the job requires a larger size of 
JVM metaspace to load classes or there is a class loading leak. In the first 
case 'taskmanager.memory.jvm-metaspace.size' configuration option should be 
increased. If the error persists (usually in cluster after several job 
(re-)submissions) then there is probably a class loading leak in user code or 
some of its dependencies which has to be investigated and fixed. The task 
executor has to be shutdown...
2021-06-24 08:59:41,759 INFO  
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - 
Shutting down TaskExecutorLocalStateStoresManager.
2021-06-24 08:59:46,560 INFO  org.apache.flink.runtime.blob.TransientBlobCache  
           [] - Shutting down BLOB cache
2021-06-24 08:59:45,414 INFO  org.apache.flink.runtime.blob.PermanentBlobCache  
           [] - Shutting down BLOB cache
2021-06-24 08:59:59,383 INFO  org.apache.flink.runtime.filecache.FileCache      
           [] - removed file cache directory 
/tmp/flink-dist-cache-bb7818d6-a691-4121-9fdb-670b7d4182c5

—
Fritz

Reply via email to