I meant, I'll check when current jobs are done and will let you know. On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
> I am running some jobs now. I'll stop and restart using pdsh to see what > was the issue again > > On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <c...@greghogan.com> wrote: > >> I'd definitely be interested to hear any insight into what failed when >> starting the taskmanagers with pdsh. Did the command fail, or fallback to >> standard ssh, a parse error on the slaves file? >> >> I'm wondering if we need to escape >> PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS >> as >> PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}" >> >> On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> pdsh is available in head node only, but when I tried to do *start-cluster >>> *from head node (note Job manager node is not head node) it didn't >>> work, which is why I modified the scripts. >>> >>> Yes, exactly, this is what I was trying to do. My research area has been >>> on these NUMA related issues and binding a process to a socket (CPU) and >>> then its thread to individual cores have shown great advantage. I actually >>> have Java code that automatically (user configurable as well) bind >>> processes and threads. For Flink, I've manually done this using shell >>> script that scans TMs in a node and pin them appropriately. This approach >>> is OK, but it's better if the support is integrated to Flink. >>> >>> On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <c...@greghogan.com> wrote: >>> >>>> Hi Saliya, >>>> >>>> Would you happen to have pdsh (parallel distributed shell) installed? >>>> If so the TaskManager startup in start-cluster.sh will run in parallel. >>>> >>>> As to running 24 TaskManagers together, are these running across >>>> multiple NUMA nodes? I had filed FLINK-3163 ( >>>> https://issues.apache.org/jira/browse/FLINK-3163) last year as I have >>>> seen that even with only two NUMA nodes performance is improved by binding >>>> TaskManagers, both memory and CPU. I think we can improve configuration of >>>> task slots as we do with memory, where the latter can be a fixed measure or >>>> a fraction relative to total memory. >>>> >>>> Greg >>>> >>>> On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The current start/stop scripts SSH worker nodes each time they appear >>>>> in the slaves file. When spawning multiple TMs (like 24 per node), this is >>>>> very inefficient. >>>>> >>>>> I've changed the scripts to do one SSH per node and spawn a given N >>>>> number of TMs afterwards. I can make a pull request if this seems usable >>>>> to >>>>> others. For now, I assume slaves file will indicate the number of TMs per >>>>> slave in "IP N" format. >>>>> >>>>> Thank you, >>>>> Saliya >>>>> >>>>> -- >>>>> Saliya Ekanayake >>>>> Ph.D. Candidate | Research Assistant >>>>> School of Informatics and Computing | Digital Science Center >>>>> Indiana University, Bloomington >>>>> >>>>> >>>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> >> > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington