Hi there,
I have a problem with running a flink job in job cluster mode using flink 1.11.1 (also tried 1.11.2). The same job is running well using the session cluster mode as well as using flink 1.10.0 in job cluster mode. The job starts running and is running for quite some time but it runs a lot slower than in session cluster mode and crashes after running for about an hour. I can observe in the flink dashboard that the JVM heap is constant at a high level and is getting slowly closer to the limit (4.13GB in my case) which it reaches close to the job crashing. There is also some G1_Old_Generation garbage collection going on which I cannot observe in session mode as well. GC values after running for about 45min: (Collector, Count, Time) G1_Young_Generation 1,250 107,937 G1_Old_Generation 322 2,432,362 Compared to the GC values of the same job in session cluster mode (after the same runtime): G1_Young_Generation 1,920 20,575 G1_Old_Generation 0 0 So my vague guess is that it has to be something memory related maybe configuration wise. To simplify the setup only one jobmanager and one taskmanager is used. The taskmanager has a memory setting of: taskmanager.memory.process.size: 10000m which should be totally fine for the server. The jobmanager has a defined heap_size of 1600m. Maybe somebody has experienced something like this before? Also is there a way to export the currently loaded configuration parameters of the job- and taskmanagers in a cluster? For example I can’t see the current memory process size of the taskmanager in the flink dashboard. Because this way I could compare the running and crashing setups more easily (using docker and environment variables for configuration at the moment which makes it a bit harder to debug). Thanks.
smime.p7s
Description: S/MIME cryptographic signature