Hi, I am running Flink on a cluster with 24 workers, each with 16 cores. Starting the cluster works fine and the Web interface confirms there are 384 slots working. Executing my code with parallelism 24 works fine, but when I try a higher parallelism, eg. 384, the job never succeeds in submitting. Also submitting from the web interface does not start the job, nor gives any errors. I also tried starting 4 1-slot taskmanagers on each machine, and executing with parallelism 96, but same problem. The code is not very complicated, with the logical graph having only 3 steps. Attached is a file with the jstacks of the CliFrontend that is using CPU, and the StandaloneSessionClusterEntrypoint, as well as the jstack of the TaskManagerRunner on a remote machine(cloud-12). The jstacks are all from this last scenario, when executing from command line. My relevant conf is as follows:
queryable-state.enable: true jobmanager.rpc.address: cloud-11 jobmanager.rpc.port: 6123 taskmanager.heap.mb: 28672 jobmanager.heap.mb: 14240 taskmanager.memory.fraction: 0.7 taskmanager.network.numberOfBuffers: 16384 taskmanager.network.bufferSizeInBytes: 16384 taskmanager.memory.task.off-heap.size: 4000m taskmanager.memory.managed.size: 10000m #taskmanager.numberOfTaskSlots: 16 #for normal setup taskmanager.numberOfTaskSlots: 1 #for when setting multiple taskmanagers per machine. Am I doing something wrong? Thanks in advance! jstack.jstack <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2502/jstack.jstack> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/