Hi Saliya, Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.
As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 ( https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory. Greg On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com> wrote: > Hi, > > The current start/stop scripts SSH worker nodes each time they appear in > the slaves file. When spawning multiple TMs (like 24 per node), this is > very inefficient. > > I've changed the scripts to do one SSH per node and spawn a given N number > of TMs afterwards. I can make a pull request if this seems usable to > others. For now, I assume slaves file will indicate the number of TMs per > slave in "IP N" format. > > Thank you, > Saliya > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > >