It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each.
When running my job like so: $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 ..... The job completes without any problems. When running it like so: $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 ..... (note the one more M of memory for the JM), the execution stalls, continuously reporting: ..... TaskManager status (6/7) TaskManager status (6/7) TaskManager status (6/7) ..... I did some poking around, but I couldn't find any direct correlation with the code. The JM log says: ..... 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - JVM Options: 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - -Xmx12289M ..... but then continues to report ..... 16:52:59,311 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing 16:52:59,831 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing 16:53:00,351 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing ..... forever until I cancel the job. If you have any ideas I'm happy to try them out. Thanks in advance for any hints! Cheers. Robert -- My GPG Key ID: 336E2680