I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again
On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <c...@greghogan.com> wrote: > I'd definitely be interested to hear any insight into what failed when > starting the taskmanagers with pdsh. Did the command fail, or fallback to > standard ssh, a parse error on the slaves file? > > I'm wondering if we need to escape > PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS > as > PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}" > > On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> pdsh is available in head node only, but when I tried to do *start-cluster >> *from head node (note Job manager node is not head node) it didn't work, >> which is why I modified the scripts. >> >> Yes, exactly, this is what I was trying to do. My research area has been >> on these NUMA related issues and binding a process to a socket (CPU) and >> then its thread to individual cores have shown great advantage. I actually >> have Java code that automatically (user configurable as well) bind >> processes and threads. For Flink, I've manually done this using shell >> script that scans TMs in a node and pin them appropriately. This approach >> is OK, but it's better if the support is integrated to Flink. >> >> On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <c...@greghogan.com> wrote: >> >>> Hi Saliya, >>> >>> Would you happen to have pdsh (parallel distributed shell) installed? If >>> so the TaskManager startup in start-cluster.sh will run in parallel. >>> >>> As to running 24 TaskManagers together, are these running across >>> multiple NUMA nodes? I had filed FLINK-3163 ( >>> https://issues.apache.org/jira/browse/FLINK-3163) last year as I have >>> seen that even with only two NUMA nodes performance is improved by binding >>> TaskManagers, both memory and CPU. I think we can improve configuration of >>> task slots as we do with memory, where the latter can be a fixed measure or >>> a fraction relative to total memory. >>> >>> Greg >>> >>> On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> The current start/stop scripts SSH worker nodes each time they appear >>>> in the slaves file. When spawning multiple TMs (like 24 per node), this is >>>> very inefficient. >>>> >>>> I've changed the scripts to do one SSH per node and spawn a given N >>>> number of TMs afterwards. I can make a pull request if this seems usable to >>>> others. For now, I assume slaves file will indicate the number of TMs per >>>> slave in "IP N" format. >>>> >>>> Thank you, >>>> Saliya >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington