pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager?
The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:108). On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <esal...@gmail.com> wrote: > Looking at what happens with pdsh, there are two things that go wrong. > > 1. pdsh is installed in a node other than where the job manager would run, > so invoking *start-cluster *from there does not spawn a job manager. Only > if I do start-cluster from the node I specify as the job manager's node > that it'll be created. > > 2. If the slaves file has the same IP more than once, then the following > error happens trying to move log files. For example I had node j-029 > specified twice in my slaves file. > > j-020: mv: cannot move > `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' > to > `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': > No such file or directory > j-020: mv: cannot move > `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' > to > `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': > No such file or directory > > > On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> I meant, I'll check when current jobs are done and will let you know. >> >> On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> I am running some jobs now. I'll stop and restart using pdsh to see what >>> was the issue again >>> >>> On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <c...@greghogan.com> wrote: >>> >>>> I'd definitely be interested to hear any insight into what failed when >>>> starting the taskmanagers with pdsh. Did the command fail, or fallback to >>>> standard ssh, a parse error on the slaves file? >>>> >>>> I'm wondering if we need to escape >>>> PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS >>>> as >>>> PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}" >>>> >>>> On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <esal...@gmail.com> >>>> wrote: >>>> >>>>> pdsh is available in head node only, but when I tried to do *start-cluster >>>>> *from head node (note Job manager node is not head node) it didn't >>>>> work, which is why I modified the scripts. >>>>> >>>>> Yes, exactly, this is what I was trying to do. My research area has >>>>> been on these NUMA related issues and binding a process to a socket (CPU) >>>>> and then its thread to individual cores have shown great advantage. I >>>>> actually have Java code that automatically (user configurable as well) >>>>> bind >>>>> processes and threads. For Flink, I've manually done this using shell >>>>> script that scans TMs in a node and pin them appropriately. This approach >>>>> is OK, but it's better if the support is integrated to Flink. >>>>> >>>>> On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <c...@greghogan.com> >>>>> wrote: >>>>> >>>>>> Hi Saliya, >>>>>> >>>>>> Would you happen to have pdsh (parallel distributed shell) installed? >>>>>> If so the TaskManager startup in start-cluster.sh will run in parallel. >>>>>> >>>>>> As to running 24 TaskManagers together, are these running across >>>>>> multiple NUMA nodes? I had filed FLINK-3163 ( >>>>>> https://issues.apache.org/jira/browse/FLINK-3163) last year as I >>>>>> have seen that even with only two NUMA nodes performance is improved by >>>>>> binding TaskManagers, both memory and CPU. I think we can improve >>>>>> configuration of task slots as we do with memory, where the latter can >>>>>> be a >>>>>> fixed measure or a fraction relative to total memory. >>>>>> >>>>>> Greg >>>>>> >>>>>> On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> The current start/stop scripts SSH worker nodes each time they >>>>>>> appear in the slaves file. When spawning multiple TMs (like 24 per >>>>>>> node), >>>>>>> this is very inefficient. >>>>>>> >>>>>>> I've changed the scripts to do one SSH per node and spawn a given N >>>>>>> number of TMs afterwards. I can make a pull request if this seems >>>>>>> usable to >>>>>>> others. For now, I assume slaves file will indicate the number of TMs >>>>>>> per >>>>>>> slave in "IP N" format. >>>>>>> >>>>>>> Thank you, >>>>>>> Saliya >>>>>>> >>>>>>> -- >>>>>>> Saliya Ekanayake >>>>>>> Ph.D. Candidate | Research Assistant >>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>> Indiana University, Bloomington >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Saliya Ekanayake >>>>> Ph.D. Candidate | Research Assistant >>>>> School of Informatics and Computing | Digital Science Center >>>>> Indiana University, Bloomington >>>>> >>>>> >>>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > >