Yes, I've password-less SSH to the job manager node. On Mon, Jul 11, 2016 at 4:53 PM, Greg Hogan <c...@greghogan.com> wrote:
> pdsh is only used for starting taskmanagers. How did you work around this? > You are able to passwordless-ssh to the jobmanager? > > The error looks to be from config.sh:318 in rotateLogFile. The way we > generate the taskmanager index assumes that taskmanagers are started > sequentially (flink-daemon.sh:108). > > On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> Looking at what happens with pdsh, there are two things that go wrong. >> >> 1. pdsh is installed in a node other than where the job manager would >> run, so invoking *start-cluster *from there does not spawn a job >> manager. Only if I do start-cluster from the node I specify as the job >> manager's node that it'll be created. >> >> 2. If the slaves file has the same IP more than once, then the following >> error happens trying to move log files. For example I had node j-029 >> specified twice in my slaves file. >> >> j-020: mv: cannot move >> `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' >> to >> `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': >> No such file or directory >> j-020: mv: cannot move >> `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' >> to >> `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': >> No such file or directory >> >> >> On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> I meant, I'll check when current jobs are done and will let you know. >>> >>> On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> I am running some jobs now. I'll stop and restart using pdsh to see >>>> what was the issue again >>>> >>>> On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <c...@greghogan.com> >>>> wrote: >>>> >>>>> I'd definitely be interested to hear any insight into what failed when >>>>> starting the taskmanagers with pdsh. Did the command fail, or fallback to >>>>> standard ssh, a parse error on the slaves file? >>>>> >>>>> I'm wondering if we need to escape >>>>> PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS >>>>> as >>>>> PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}" >>>>> >>>>> On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <esal...@gmail.com> >>>>> wrote: >>>>> >>>>>> pdsh is available in head node only, but when I tried to do >>>>>> *start-cluster >>>>>> *from head node (note Job manager node is not head node) it didn't >>>>>> work, which is why I modified the scripts. >>>>>> >>>>>> Yes, exactly, this is what I was trying to do. My research area has >>>>>> been on these NUMA related issues and binding a process to a socket (CPU) >>>>>> and then its thread to individual cores have shown great advantage. I >>>>>> actually have Java code that automatically (user configurable as well) >>>>>> bind >>>>>> processes and threads. For Flink, I've manually done this using shell >>>>>> script that scans TMs in a node and pin them appropriately. This approach >>>>>> is OK, but it's better if the support is integrated to Flink. >>>>>> >>>>>> On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <c...@greghogan.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Saliya, >>>>>>> >>>>>>> Would you happen to have pdsh (parallel distributed shell) >>>>>>> installed? If so the TaskManager startup in start-cluster.sh will run in >>>>>>> parallel. >>>>>>> >>>>>>> As to running 24 TaskManagers together, are these running across >>>>>>> multiple NUMA nodes? I had filed FLINK-3163 ( >>>>>>> https://issues.apache.org/jira/browse/FLINK-3163) last year as I >>>>>>> have seen that even with only two NUMA nodes performance is improved by >>>>>>> binding TaskManagers, both memory and CPU. I think we can improve >>>>>>> configuration of task slots as we do with memory, where the latter can >>>>>>> be a >>>>>>> fixed measure or a fraction relative to total memory. >>>>>>> >>>>>>> Greg >>>>>>> >>>>>>> On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> The current start/stop scripts SSH worker nodes each time they >>>>>>>> appear in the slaves file. When spawning multiple TMs (like 24 per >>>>>>>> node), >>>>>>>> this is very inefficient. >>>>>>>> >>>>>>>> I've changed the scripts to do one SSH per node and spawn a given N >>>>>>>> number of TMs afterwards. I can make a pull request if this seems >>>>>>>> usable to >>>>>>>> others. For now, I assume slaves file will indicate the number of TMs >>>>>>>> per >>>>>>>> slave in "IP N" format. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Saliya >>>>>>>> >>>>>>>> -- >>>>>>>> Saliya Ekanayake >>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>> Indiana University, Bloomington >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Saliya Ekanayake >>>>>> Ph.D. Candidate | Research Assistant >>>>>> School of Informatics and Computing | Digital Science Center >>>>>> Indiana University, Bloomington >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington