All but one TMs connect when JM has more than 16G of memory

Robert Schmidtke Wed, 30 Sep 2015 08:09:49 -0700

It's me again. This is a strange issue, I hope I managed to find the right
keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of
memory each.


When running my job like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....

The job completes without any problems. When running it like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....

(note the one more M of memory for the JM), the execution stalls,
continuously reporting:

.....
TaskManager status (6/7)
TaskManager status (6/7)
TaskManager status (6/7)
.....

I did some poking around, but I couldn't find any direct correlation with
the code.

The JM log says:

.....
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
     -  JVM Options:
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
     -     -Xmx12289M
.....

but then continues to report

.....
16:52:59,311 INFO
 org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
requested 7 containers, 6 running. 1 containers missing
16:52:59,831 INFO
 org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
requested 7 containers, 6 running. 1 containers missing
16:53:00,351 INFO
 org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
requested 7 containers, 6 running. 1 containers missing
.....

forever until I cancel the job.

If you have any ideas I'm happy to try them out. Thanks in advance for any
hints! Cheers.

Robert
-- 
My GPG Key ID: 336E2680

All but one TMs connect when JM has more than 16G of memory

Reply via email to