Job crash in job cluster mode

Tim Eckhardt Tue, 10 Nov 2020 06:37:22 -0800

Hi there,


I have a problem with running a flink job in job cluster mode using flink 
1.11.1 (also tried 1.11.2).

The same job is running well using the session cluster mode as well as using 
flink 1.10.0 in job cluster mode.

 

The job starts running and is running for quite some time but it runs a lot 
slower than in session cluster mode and crashes after running for about an 
hour. I can observe in the flink dashboard that the JVM heap is constant at a 
high level and is getting slowly closer to the limit (4.13GB in my case) which 
it reaches close to the job crashing. 

There is also some G1_Old_Generation garbage collection going on which I cannot 
observe in session mode as well.

 

GC values after running for about 45min:

 

(Collector, Count, Time)

G1_Young_Generation   1,250  107,937

G1_Old_Generation  322  2,432,362

 

Compared to the GC values of the same job in session cluster mode (after the 
same runtime):

 

G1_Young_Generation   1,920  20,575

G1_Old_Generation  0  0

 

So my vague guess is that it has to be something memory related maybe 
configuration wise.

 

To simplify the setup only one jobmanager and one taskmanager is used. The 
taskmanager has a memory setting of: taskmanager.memory.process.size: 10000m 
which should be totally fine for the server. The jobmanager has a defined 
heap_size of 1600m. 

 

Maybe somebody has experienced something like this before?

 

Also is there a way to export the currently loaded configuration parameters of 
the job- and taskmanagers in a cluster? For example I can’t see the current 
memory process size of the taskmanager in the flink dashboard. Because this way 
I could compare the running and crashing setups more easily (using docker and 
environment variables for configuration at the moment which makes it a bit 
harder to debug).

 

Thanks.

smime.p7s
Description: S/MIME cryptographic signature

Job crash in job cluster mode

Reply via email to