Hi Ralph I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0. The results are interesting!
I wrote a small HelloMPI app which basically calls usleep for a pause of 5 seconds. Now calling it as i did before, no MPI errors appear anymore, only the display problems: jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI /usr/bin/xterm Xt error: Can't open display: localhost:10.0 When i do the same call *with* the debug option, the xterm appears and shows the output of HelloMPI! I attach the output in ompidbg_1.txt (It also works if i call with '-np 4' and '--xterm 0,1,2,3' Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt). If i use the hold-option, the xterm appears with the output of 'hostrname' (cf. ompidbg_3.txt) The xterm opens after the line "launch complete for job..." has been written (line 59) I just found that everything works as expected if i use the the '--leave-session-attached' option (without the debug options): jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! ./HelloMPI The xterms are also opened if i do not use the '!' hold option. What does *not* work is jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm xterm Xt error: Can't open display: xterm: DISPLAY is not set xterm Xt error: Can't open display: xterm: DISPLAY is not set But then again, this call works (i.e. an xterm is opened) if all the debug-options are used (ompidbg_4.txt). Here the '--leave-session-attached' is necessary - without it, no xterm. >From these results i would say that there is no basic mishandling of 'ssh', though i have no idea what internal differences the use of the '-leave-session-attached' option or the debug options make. I hope these observations are helpful Jody On Fri, Apr 29, 2011 at 12:08 AM, jody <jody....@gmail.com> wrote: > Hi Ralph > > Thank you for your suggestions. > I'll be happy to help you. > I'm not sure if i'll get around to this tomorrow, > but i certainly will do so on Monday. > > Thanks > Jody > > On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <r...@open-mpi.org> wrote: >> Hi Jody >> >> I'm not sure when I'll get a chance to work on this - got a deadline to >> meet. I do have a couple of suggestions, if you wouldn't mind helping debug >> the problem? >> >> It looks to me like the problem is that mpirun is crashing or terminating >> early for some reason - hence the failures to send msgs to it, and the >> "lifeline lost" error that leads to the termination of the daemon. If you >> build a debug version of the code (i.e., --enable-debug on configure), you >> can get a lot of debug info that traces the behavior. >> >> If you could then run your program with >> >> -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached >> >> and send it to me, we'll see what ORTE thinks it is doing. >> >> You could also take a look at the code for implementing the xterm option. >> You'll find it in >> >> orte/mca/odls/base/odls_base_default_fns.c >> >> around line 1115. The xterm command syntax is defined in >> >> orte/mca/odls/base/odls_base_open.c >> >> around line 233 and following. Note that we use "xterm -T" as the cmd. >> Perhaps you can spot an error in the way we treat xterm? >> >> Also, remember that you have to specify that you want us to "hold" the xterm >> window open even after the process terminates. If you don't specify it, the >> window automatically closes upon completion of the process. So a >> fast-running cmd like "hostname" might disappear so quickly that it causes a >> race condition problem. >> >> You might want to try a spinner application - i.e.., output something and >> then sit in a loop or sleep for some period of time. Or, use the "hold" >> option to keep the window open - you designate "hold" by putting a '!' >> before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname" >> >> >> On Apr 28, 2011, at 8:38 AM, jody wrote: >> >>> Hi >>> >>> Unfortunately this does not solve my problem. >>> While i can do >>> ssh -Y squid_0 xterm >>> and this will open an xterm on m,y machiine (chefli), >>> i run into problems with the -xterm option of openmpi: >>> >>> jody@chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh >>> -Y" -host squid_0 --xterm 1 hostname >>> squid_0 >>> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>> [sd = 8] >>> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >>> lifeline [[35219,0],0] lost >>> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>> [sd = 8] >>> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >>> lifeline [[35219,0],0] lost >>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>> >>> By the way when i look at the DISPLAY variable in the xterm window >>> opened via squid_0, >>> i also have the display variable "localhost:11.0" >>> >>> Actually, the difference with using the "-mca plm_rsh_agent" is that >>> the lines wiht the warnings about "xauth" and "untrusted X" do not >>> appear: >>> >>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 >>> hostname >>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>> generated >>> Warning: No xauth data; using fake authentication data for X11 forwarding. >>> squid_0 >>> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>> [sd = 8] >>> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >>> lifeline [[34926,0],0] lost >>> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>> [sd = 8] >>> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >>> lifeline [[34926,0],0] lost >>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>> >>> >>> I have doubts that the "-Y" is passed correctly: >>> jody@triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh >>> -Y" -host squid_0 xterm >>> xterm Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> xterm Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> >>> >>> ---> as a matter of fact i noticed that the xterm option doesn't work >>> locally: >>> mpirun -np 4 -xterm 1 /usr/bin/printenv >>> prints verything onto the console. >>> >>> Do you have any other suggestions i could try? >>> >>> Thank You >>> Jody >>> >>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> Should be able to just set >>>> >>>> -mca plm_rsh_agent "ssh -Y" >>>> >>>> on your cmd line, I believe >>>> >>>> On Apr 28, 2011, at 12:53 AM, jody wrote: >>>> >>>>> Hi Ralph >>>>> >>>>> Is there an easy way i could modify the OpenMPI code so that it would use >>>>> the -Y option for ssh when connecting to remote machines? >>>>> >>>>> Thank You >>>>> Jody >>>>> >>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote: >>>>>> Hi Ralph >>>>>> thank you for your suggestions. After some fiddling, i found that after >>>>>> my >>>>>> last update (gentoo) my sshd_config had been overwritten >>>>>> (X11Forwarding was set to 'no'). >>>>>> >>>>>> After correcting that, i can now open remote terminals with 'ssh -Y' >>>>>> and with 'ssh -X' >>>>>> (but with '-X' is till get those xauth warnings) >>>>>> >>>>>> But the xterm option still doesn't work: >>>>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 >>>>>> printenv | grep WORLD_RANK >>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>> generated >>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>> forwarding. >>>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>> [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] >>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>>>> [sd = 8] >>>>>> [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to >>>>>> lifeline [[54132,0],0] lost >>>>>> >>>>>> So it looks like the two processes from squid_0 can't open the display >>>>>> this way, >>>>>> but one of them writes the output to the console... >>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh >>>>>> -Y' the >>>>>> DISPLAY variable is set to 'localhost:10.0' >>>>>> >>>>>> So in what way would OMPI have to be adapted, so -xterm would work? >>>>>> >>>>>> Thank You >>>>>> Jody >>>>>> >>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything >>>>>>> Cygwin-specific in the answers: >>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding >>>>>>> >>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: >>>>>>> >>>>>>> Sorry Jody - I should have read your note more carefully to see that you >>>>>>> already tried -Y. :-( >>>>>>> Not sure what to suggest... >>>>>>> >>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: >>>>>>> >>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this >>>>>>> result: >>>>>>> >>>>>>> When trying to set up x11 forwarding over an ssh session to a remote >>>>>>> server >>>>>>> with the -X switch, I was getting an error like Warning: No xauth >>>>>>> data; using fake authentication data for X11 forwarding. >>>>>>> >>>>>>> When doing something like: >>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, >>>>>>> but I >>>>>>> got an error message like: >>>>>>> >>>>>>> >>>>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9 >>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>>> generated >>>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>>> forwarding. >>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5 >>>>>>> [root@RHEL ~]# >>>>>>> and any X programs I ran would not display on my local system.. >>>>>>> >>>>>>> Turns out the solution is to use the -Y switch instead. >>>>>>> >>>>>>> ssh -Yl root 10.1.1.9 >>>>>>> >>>>>>> and that worked fine. >>>>>>> >>>>>>> See if that works for you - if it does, we may have to modify OMPI to >>>>>>> accommodate. >>>>>>> >>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote: >>>>>>> >>>>>>> Hi Ralph >>>>>>> No, after the above error message mpirun has exited. >>>>>>> >>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm >>>>>>> there: >>>>>>> >>>>>>> jody@chefli ~/share/neander $ ssh -Y squid_0 >>>>>>> Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 >>>>>>> jody@squid_0 ~ $ xterm >>>>>>> xterm Xt error: Can't open display: >>>>>>> xterm: DISPLAY is not set >>>>>>> jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 >>>>>>> jody@squid_0 ~ $ xterm >>>>>>> xterm Xt error: Can't open display: 130.60.126.74:0.0 >>>>>>> jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 >>>>>>> jody@squid_0 ~ $ xterm >>>>>>> xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>>> jody@squid_0 ~ $ exit >>>>>>> logout >>>>>>> >>>>>>> same thing with ssh -X, but here i get the same warning/error message >>>>>>> as with mpirun: >>>>>>> >>>>>>> jody@chefli ~/share/neander $ ssh -X squid_0 >>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>>> generated >>>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>>> forwarding. >>>>>>> Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh >>>>>>> >>>>>>> So perhaps the whole problem is linked to that xauth-thing. >>>>>>> Do you have a suggestion how this can be solved? >>>>>>> >>>>>>> Thank You >>>>>>> Jody >>>>>>> >>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>> If I read your error messages correctly, it looks like mpirun is >>>>>>> crashing - >>>>>>> the daemon is complaining that it lost the socket connection back to >>>>>>> mpirun, >>>>>>> and hence will abort. >>>>>>> >>>>>>> Are you seeing mpirun still alive? >>>>>>> >>>>>>> >>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote: >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that >>>>>>> >>>>>>> it works in "text-mode": >>>>>>> >>>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK >>>>>>> >>>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>>> >>>>>>> OMPI_COMM_WORLD_RANK=1 >>>>>>> >>>>>>> OMPI_COMM_WORLD_RANK=2 >>>>>>> >>>>>>> OMPI_COMM_WORLD_RANK=3 >>>>>>> >>>>>>> but when i use the -xterm option to mpirun, it doesn't work >>>>>>> >>>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep >>>>>>> WORLD_RANK >>>>>>> >>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>>> generated >>>>>>> >>>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>>> forwarding. >>>>>>> >>>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>>> >>>>>>> [squid_0:05266] [[55607,0],1]->[[55607,0],0] >>>>>>> >>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>>>>> >>>>>>> [sd = 8] >>>>>>> >>>>>>> [squid_0:05266] [[55607,0],1] routed:binomial: Connection to >>>>>>> >>>>>>> lifeline [[55607,0],0] lost >>>>>>> >>>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>>> >>>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>>> >>>>>>> (strange: somebody wrote his message to the console) >>>>>>> >>>>>>> No matter whether i set the DISPLAY variable to the full hostname of >>>>>>> >>>>>>> the workstation, >>>>>>> >>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work >>>>>>> >>>>>>> But i do have xauth data (as far as i know): >>>>>>> >>>>>>> On the remote (squid_0): >>>>>>> >>>>>>> jody@squid_0 ~ $ xauth list >>>>>>> >>>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>>>> >>>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>>> >>>>>>> chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>>> >>>>>>> on the workstation: >>>>>>> >>>>>>> $ xauth list >>>>>>> >>>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>>>> >>>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>>> >>>>>>> localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1 >>>>>>> >>>>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>>>> >>>>>>> chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 >>>>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>>>> >>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes' >>>>>>> >>>>>>> I have also done >>>>>>> >>>>>>> xhost + squid_0 >>>>>>> >>>>>>> on the workstation. >>>>>>> >>>>>>> >>>>>>> How can i get the -xterm option running? >>>>>>> >>>>>>> Thank You >>>>>>> >>>>>>> Jody >>>>>>> >>>>>>> _______________________________________________ >>>>>>> >>>>>>> users mailing list >>>>>>> >>>>>>> us...@open-mpi.org >>>>>>> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> >>>>>>> users mailing list >>>>>>> >>>>>>> us...@open-mpi.org >>>>>>> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >
jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0 ./HelloMPI [chefli:02420] mca:base:select:( plm) Querying component [rsh] [chefli:02420] mca:base:select:( plm) Query of component [rsh] set priority to 10 [chefli:02420] mca:base:select:( plm) Querying component [slurm] [chefli:02420] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [chefli:02420] mca:base:select:( plm) Selected component [rsh] [chefli:02420] plm:base:set_hnp_name: initial bias 2420 nodename hash 72192778 [chefli:02420] plm:base:set_hnp_name: final jobfam 40499 [chefli:02420] [[40499,0],0] plm:base:receive start comm [chefli:02420] mca:base:select:( odls) Querying component [default] [chefli:02420] mca:base:select:( odls) Query of component [default] set priority to 1 [chefli:02420] mca:base:select:( odls) Selected component [default] [chefli:02420] [[40499,0],0] plm:rsh: setting up job [40499,1] [chefli:02420] [[40499,0],0] plm:base:setup_job for job [40499,1] [chefli:02420] [[40499,0],0] plm:rsh: local shell: 0 (bash) [chefli:02420] [[40499,0],0] plm:rsh: assuming same remote shell as local shell [chefli:02420] [[40499,0],0] plm:rsh: remote shell: 0 (bash) [chefli:02420] [[40499,0],0] plm:rsh: final template argv: /usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri "2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y" [chefli:02420] [[40499,0],0] plm:rsh: launching on node squid_0 [chefli:02420] [[40499,0],0] plm:rsh: recording launch of daemon [[40499,0],1] [chefli:02420] [[40499,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted -mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"] [squid_0:19442] mca:base:select:( odls) Querying component [default] [squid_0:19442] mca:base:select:( odls) Query of component [default] set priority to 1 [squid_0:19442] mca:base:select:( odls) Selected component [default] [chefli:02420] [[40499,0],0] plm:base:daemon_callback [chefli:02420] [[40499,0],0] plm:base:orted_report_launch from daemon [[40499,0],1] [chefli:02420] [[40499,0],0] plm:base:orted_report_launch completed for daemon [[40499,0],1] [chefli:02420] [[40499,0],0] plm:base:daemon_callback completed [chefli:02420] [[40499,0],0] plm:base:launch_apps for job [40499,1] [chefli:02420] [[40499,0],0] plm:base:report_launched for job [40499,1] [chefli:02420] [[40499,0],0] odls:constructing child list [chefli:02420] [[40499,0],0] odls:construct_child_list unpacking data to launch job [40499,1] [chefli:02420] [[40499,0],0] odls:construct_child_list adding new jobdat for job [40499,1] [chefli:02420] [[40499,0],0] odls:construct_child_list unpacking 1 app_contexts [chefli:02420] [[40499,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [chefli:02420] [[40499,0],0] odls:construct:child: num_participating 1 [chefli:02420] [[40499,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false [chefli:02420] [[40499,0],0] odls:launch reporting job [40499,1] launch status [chefli:02420] [[40499,0],0] odls:launch setting waitpids [chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],0] [chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing [squid_0:19442] [[40499,0],1] odls:constructing child list [squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking data to launch job [40499,1] [squid_0:19442] [[40499,0],1] odls:construct_child_list adding new jobdat for job [40499,1] [squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking 1 app_contexts [squid_0:19442] [[40499,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [squid_0:19442] [[40499,0],1] odls:constructing child list - found proc 0 for me! [squid_0:19442] [[40499,0],1] odls:construct:child: num_participating 1 [squid_0:19442] [[40499,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false [squid_0:19442] [[40499,0],1] odls:launch reporting job [40499,1] launch status [squid_0:19442] [[40499,0],1] odls:launch setting waitpids [chefli:02420] [[40499,0],0] plm:base:app_report_launch reissuing non-blocking recv [chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],1] [chefli:02420] [[40499,0],0] plm:base:app_report_launched for proc [[40499,1],0] from daemon [[40499,0],1]: pid 19446 state 2 exit 0 [chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing [chefli:02420] [[40499,0],0] plm:base:report_launched all apps reported [chefli:02420] [[40499,0],0] plm:base:launch wiring up iof [chefli:02420] [[40499,0],0] plm:base:launch completed for job [40499,1] [squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls:sync nidmap requested for job [40499,1] [squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 144 bytes of data [squid_0:19442] [[40499,0],1] odls: sending contact info to HNP [squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: executing collective [squid_0:19442] [[40499,0],1] odls: daemon collective called [squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2 num_collected 1 num_participating 1 num_contributors 1 [squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0] [squid_0:19442] [[40499,0],1] odls: collective completed [chefli:02420] [[40499,0],0] odls: daemon collective called [chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2 num_collected 1 num_participating 1 num_contributors 1 [chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1] [squid_0:19442] [[40499,0],1] odls: sending message to tag 15 on child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: executing collective [squid_0:19442] [[40499,0],1] odls: daemon collective called [squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1 [squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0] [squid_0:19442] [[40499,0],1] odls: collective completed [chefli:02420] [[40499,0],0] odls: daemon collective called [chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1 [chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1] [squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: executing collective [squid_0:19442] [[40499,0],1] odls: daemon collective called [squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1 [squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0] [squid_0:19442] [[40499,0],1] odls: collective completed [chefli:02420] [[40499,0],0] odls: daemon collective called [chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1 [chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1] [squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 0 bytes of data [chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1] [chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,1] [chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,1],0] curnt state 4 new state 80 exit_code 0 [chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,1] - num_terminated 1 num_procs 1 [chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,1] normally terminated - checking all jobs [chefli:02420] [[40499,0],0] plm:base:check_job_completed all jobs terminated - waking up [chefli:02420] [[40499,0],0] plm:base:orted_cmd sending orted_exit commands [chefli:02420] [[40499,0],0] odls:kill_local_proc working on job [WILDCARD] [chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 1 num_procs 2 [squid_0:19442] [[40499,0],1] odls:wait_local_proc child process 19446 terminated [squid_0:19442] [[40499,0],1] odls:notify_iof_complete for child [[40499,1],0] [squid_0:19442] [[40499,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody@squid_0_0/2654142465/0/abort [chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1] [chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,0] [chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,0],1] curnt state 4 new state 80 exit_code 0 [chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 2 num_procs 2 [chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,0] normally terminated - checking all jobs [chefli:02420] [[40499,0],0] plm:base:receive stop comm [squid_0:19442] [[40499,0],1] odls:waitpid_fired child process [[40499,1],0] terminated normally [squid_0:19442] [[40499,0],1] odls:proc_complete reporting all procs in [40499,1] terminated [squid_0:19442] [[40499,0],1] odls:kill_local_proc working on job [WILDCARD]
jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0 hostname [chefli:02476] mca:base:select:( plm) Querying component [rsh] [chefli:02476] mca:base:select:( plm) Query of component [rsh] set priority to 10 [chefli:02476] mca:base:select:( plm) Querying component [slurm] [chefli:02476] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [chefli:02476] mca:base:select:( plm) Selected component [rsh] [chefli:02476] plm:base:set_hnp_name: initial bias 2476 nodename hash 72192778 [chefli:02476] plm:base:set_hnp_name: final jobfam 40683 [chefli:02476] [[40683,0],0] plm:base:receive start comm [chefli:02476] mca:base:select:( odls) Querying component [default] [chefli:02476] mca:base:select:( odls) Query of component [default] set priority to 1 [chefli:02476] mca:base:select:( odls) Selected component [default] [chefli:02476] [[40683,0],0] plm:rsh: setting up job [40683,1] [chefli:02476] [[40683,0],0] plm:base:setup_job for job [40683,1] [chefli:02476] [[40683,0],0] plm:rsh: local shell: 0 (bash) [chefli:02476] [[40683,0],0] plm:rsh: assuming same remote shell as local shell [chefli:02476] [[40683,0],0] plm:rsh: remote shell: 0 (bash) [chefli:02476] [[40683,0],0] plm:rsh: final template argv: /usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri "2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y" [chefli:02476] [[40683,0],0] plm:rsh: launching on node squid_0 [chefli:02476] [[40683,0],0] plm:rsh: recording launch of daemon [[40683,0],1] [chefli:02476] [[40683,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted -mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"] [squid_0:19579] mca:base:select:( odls) Querying component [default] [squid_0:19579] mca:base:select:( odls) Query of component [default] set priority to 1 [squid_0:19579] mca:base:select:( odls) Selected component [default] [chefli:02476] [[40683,0],0] plm:base:daemon_callback [chefli:02476] [[40683,0],0] plm:base:orted_report_launch from daemon [[40683,0],1] [chefli:02476] [[40683,0],0] plm:base:orted_report_launch completed for daemon [[40683,0],1] [chefli:02476] [[40683,0],0] plm:base:daemon_callback completed [chefli:02476] [[40683,0],0] plm:base:launch_apps for job [40683,1] [chefli:02476] [[40683,0],0] plm:base:report_launched for job [40683,1] [chefli:02476] [[40683,0],0] odls:constructing child list [chefli:02476] [[40683,0],0] odls:construct_child_list unpacking data to launch job [40683,1] [chefli:02476] [[40683,0],0] odls:construct_child_list adding new jobdat for job [40683,1] [chefli:02476] [[40683,0],0] odls:construct_child_list unpacking 1 app_contexts [chefli:02476] [[40683,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [chefli:02476] [[40683,0],0] odls:construct:child: num_participating 1 [chefli:02476] [[40683,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false [chefli:02476] [[40683,0],0] odls:launch reporting job [40683,1] launch status [chefli:02476] [[40683,0],0] odls:launch setting waitpids [chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],0] [chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing [squid_0:19579] [[40683,0],1] odls:constructing child list [squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking data to launch job [40683,1] [squid_0:19579] [[40683,0],1] odls:construct_child_list adding new jobdat for job [40683,1] [squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking 1 app_contexts [squid_0:19579] [[40683,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [squid_0:19579] [[40683,0],1] odls:constructing child list - found proc 0 for me! [squid_0:19579] [[40683,0],1] odls:construct:child: num_participating 1 [squid_0:19579] [[40683,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false [squid_0:19579] [[40683,0],1] odls:launch reporting job [40683,1] launch status [squid_0:19579] [[40683,0],1] odls:launch setting waitpids [chefli:02476] [[40683,0],0] plm:base:app_report_launch reissuing non-blocking recv [chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],1] [chefli:02476] [[40683,0],0] plm:base:app_report_launched for proc [[40683,1],0] from daemon [[40683,0],1]: pid 19583 state 2 exit 0 [chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing [chefli:02476] [[40683,0],0] plm:base:report_launched all apps reported [chefli:02476] [[40683,0],0] plm:base:launch wiring up iof [chefli:02476] [[40683,0],0] plm:base:launch completed for job [40683,1] [squid_0:19579] [[40683,0],1] odls:wait_local_proc child process 19583 terminated [squid_0:19579] [[40683,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody@squid_0_0/2666201089/0/abort [squid_0:19579] [[40683,0],1] odls:waitpid_fired child process [[40683,1],0] terminated normally [squid_0:19579] [[40683,0],1] odls:notify_iof_complete for child [[40683,1],0] [chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1] [chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,1] [chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,1],0] curnt state 2 new state 80 exit_code 0 [chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,1] - num_terminated 1 num_procs 1 [chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,1] normally terminated - checking all jobs [chefli:02476] [[40683,0],0] plm:base:check_job_completed all jobs terminated - waking up [chefli:02476] [[40683,0],0] plm:base:orted_cmd sending orted_exit commands [chefli:02476] [[40683,0],0] odls:kill_local_proc working on job [WILDCARD] [chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 1 num_procs 2 [squid_0:19579] [[40683,0],1] odls:proc_complete reporting all procs in [40683,1] terminated [chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1] [chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,0] [chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,0],1] curnt state 4 new state 80 exit_code 0 [chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 2 num_procs 2 [chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,0] normally terminated - checking all jobs [chefli:02476] [[40683,0],0] plm:base:receive stop comm [squid_0:19579] [[40683,0],1] odls:kill_local_proc working on job [WILDCARD]
jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0! hostname [chefli:02487] mca:base:select:( plm) Querying component [rsh] [chefli:02487] mca:base:select:( plm) Query of component [rsh] set priority to 10 [chefli:02487] mca:base:select:( plm) Querying component [slurm] [chefli:02487] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [chefli:02487] mca:base:select:( plm) Selected component [rsh] [chefli:02487] plm:base:set_hnp_name: initial bias 2487 nodename hash 72192778 [chefli:02487] plm:base:set_hnp_name: final jobfam 40688 [chefli:02487] [[40688,0],0] plm:base:receive start comm [chefli:02487] mca:base:select:( odls) Querying component [default] [chefli:02487] mca:base:select:( odls) Query of component [default] set priority to 1 [chefli:02487] mca:base:select:( odls) Selected component [default] [chefli:02487] [[40688,0],0] plm:rsh: setting up job [40688,1] [chefli:02487] [[40688,0],0] plm:base:setup_job for job [40688,1] [chefli:02487] [[40688,0],0] plm:rsh: local shell: 0 (bash) [chefli:02487] [[40688,0],0] plm:rsh: assuming same remote shell as local shell [chefli:02487] [[40688,0],0] plm:rsh: remote shell: 0 (bash) [chefli:02487] [[40688,0],0] plm:rsh: final template argv: /usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri "2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y" [chefli:02487] [[40688,0],0] plm:rsh: launching on node squid_0 [chefli:02487] [[40688,0],0] plm:rsh: recording launch of daemon [[40688,0],1] [chefli:02487] [[40688,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted -mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y"] [squid_0:19613] mca:base:select:( odls) Querying component [default] [squid_0:19613] mca:base:select:( odls) Query of component [default] set priority to 1 [squid_0:19613] mca:base:select:( odls) Selected component [default] [chefli:02487] [[40688,0],0] plm:base:daemon_callback [chefli:02487] [[40688,0],0] plm:base:orted_report_launch from daemon [[40688,0],1] [chefli:02487] [[40688,0],0] plm:base:orted_report_launch completed for daemon [[40688,0],1] [chefli:02487] [[40688,0],0] plm:base:daemon_callback completed [chefli:02487] [[40688,0],0] plm:base:launch_apps for job [40688,1] [chefli:02487] [[40688,0],0] plm:base:report_launched for job [40688,1] [chefli:02487] [[40688,0],0] odls:constructing child list [chefli:02487] [[40688,0],0] odls:construct_child_list unpacking data to launch job [40688,1] [chefli:02487] [[40688,0],0] odls:construct_child_list adding new jobdat for job [40688,1] [chefli:02487] [[40688,0],0] odls:construct_child_list unpacking 1 app_contexts [chefli:02487] [[40688,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [chefli:02487] [[40688,0],0] odls:construct:child: num_participating 1 [chefli:02487] [[40688,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false [chefli:02487] [[40688,0],0] odls:launch reporting job [40688,1] launch status [chefli:02487] [[40688,0],0] odls:launch setting waitpids [chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],0] [chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing [squid_0:19613] [[40688,0],1] odls:constructing child list [squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking data to launch job [40688,1] [squid_0:19613] [[40688,0],1] odls:construct_child_list adding new jobdat for job [40688,1] [squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking 1 app_contexts [squid_0:19613] [[40688,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [squid_0:19613] [[40688,0],1] odls:constructing child list - found proc 0 for me! [squid_0:19613] [[40688,0],1] odls:construct:child: num_participating 1 [squid_0:19613] [[40688,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false [squid_0:19613] [[40688,0],1] odls:launch reporting job [40688,1] launch status [squid_0:19613] [[40688,0],1] odls:launch setting waitpids [chefli:02487] [[40688,0],0] plm:base:app_report_launch reissuing non-blocking recv [chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],1] [chefli:02487] [[40688,0],0] plm:base:app_report_launched for proc [[40688,1],0] from daemon [[40688,0],1]: pid 19617 state 2 exit 0 [chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing [chefli:02487] [[40688,0],0] plm:base:report_launched all apps reported [chefli:02487] [[40688,0],0] plm:base:launch wiring up iof [chefli:02487] [[40688,0],0] plm:base:launch completed for job [40688,1] [squid_0:19613] [[40688,0],1] odls:wait_local_proc child process 19617 terminated [squid_0:19613] [[40688,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody@squid_0_0/2666528769/0/abort [squid_0:19613] [[40688,0],1] odls:waitpid_fired child process [[40688,1],0] terminated normally [squid_0:19613] [[40688,0],1] odls:notify_iof_complete for child [[40688,1],0] [squid_0:19613] [[40688,0],1] odls:proc_complete reporting all procs in [40688,1] terminated [chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1] [chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,1] [chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,1],0] curnt state 2 new state 80 exit_code 0 [chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,1] - num_terminated 1 num_procs 1 [chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,1] normally terminated - checking all jobs [chefli:02487] [[40688,0],0] plm:base:check_job_completed all jobs terminated - waking up [chefli:02487] [[40688,0],0] plm:base:orted_cmd sending orted_exit commands [chefli:02487] [[40688,0],0] odls:kill_local_proc working on job [WILDCARD] [chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 1 num_procs 2 [chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1] [chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,0] [chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,0],1] curnt state 4 new state 80 exit_code 0 [chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 2 num_procs 2 [chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,0] normally terminated - checking all jobs [squid_0:19613] [[40688,0],1] odls:kill_local_proc working on job [WILDCARD] [chefli:02487] [[40688,0],0] plm:base:receive stop comm
jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached xterm [chefli:02619] mca:base:select:( plm) Querying component [rsh] [chefli:02619] mca:base:select:( plm) Query of component [rsh] set priority to 10 [chefli:02619] mca:base:select:( plm) Querying component [slurm] [chefli:02619] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [chefli:02619] mca:base:select:( plm) Selected component [rsh] [chefli:02619] plm:base:set_hnp_name: initial bias 2619 nodename hash 72192778 [chefli:02619] plm:base:set_hnp_name: final jobfam 40316 [chefli:02619] [[40316,0],0] plm:base:receive start comm [chefli:02619] mca:base:select:( odls) Querying component [default] [chefli:02619] mca:base:select:( odls) Query of component [default] set priority to 1 [chefli:02619] mca:base:select:( odls) Selected component [default] [chefli:02619] [[40316,0],0] plm:rsh: setting up job [40316,1] [chefli:02619] [[40316,0],0] plm:base:setup_job for job [40316,1] [chefli:02619] [[40316,0],0] plm:rsh: local shell: 0 (bash) [chefli:02619] [[40316,0],0] plm:rsh: assuming same remote shell as local shell [chefli:02619] [[40316,0],0] plm:rsh: remote shell: 0 (bash) [chefli:02619] [[40316,0],0] plm:rsh: final template argv: /usr/bin/ssh -Y <template> orted -mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri "2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y" [chefli:02619] [[40316,0],0] plm:rsh: launching on node squid_0 [chefli:02619] [[40316,0],0] plm:rsh: recording launch of daemon [[40316,0],1] [chefli:02619] [[40316,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y squid_0 orted -mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y"] [squid_0:20023] mca:base:select:( odls) Querying component [default] [squid_0:20023] mca:base:select:( odls) Query of component [default] set priority to 1 [squid_0:20023] mca:base:select:( odls) Selected component [default] [chefli:02619] [[40316,0],0] plm:base:daemon_callback [chefli:02619] [[40316,0],0] plm:base:orted_report_launch from daemon [[40316,0],1] [chefli:02619] [[40316,0],0] plm:base:orted_report_launch completed for daemon [[40316,0],1] [chefli:02619] [[40316,0],0] plm:base:daemon_callback completed [chefli:02619] [[40316,0],0] plm:base:launch_apps for job [40316,1] [chefli:02619] [[40316,0],0] plm:base:report_launched for job [40316,1] [chefli:02619] [[40316,0],0] odls:constructing child list [chefli:02619] [[40316,0],0] odls:construct_child_list unpacking data to launch job [40316,1] [chefli:02619] [[40316,0],0] odls:construct_child_list adding new jobdat for job [40316,1] [chefli:02619] [[40316,0],0] odls:construct_child_list unpacking 1 app_contexts [chefli:02619] [[40316,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [chefli:02619] [[40316,0],0] odls:construct:child: num_participating 1 [chefli:02619] [[40316,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false [chefli:02619] [[40316,0],0] odls:launch reporting job [40316,1] launch status [chefli:02619] [[40316,0],0] odls:launch setting waitpids [chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],0] [chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing [squid_0:20023] [[40316,0],1] odls:constructing child list [squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking data to launch job [40316,1] [squid_0:20023] [[40316,0],1] odls:construct_child_list adding new jobdat for job [40316,1] [squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking 1 app_contexts [squid_0:20023] [[40316,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [squid_0:20023] [[40316,0],1] odls:constructing child list - found proc 0 for me! [squid_0:20023] [[40316,0],1] odls:construct:child: num_participating 1 [squid_0:20023] [[40316,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false [chefli:02619] [[40316,0],0] plm:base:app_report_launch reissuing non-blocking recv [chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],1] [chefli:02619] [[40316,0],0] plm:base:app_report_launched for proc [[40316,1],0] from daemon [[40316,0],1]: pid 20027 state 2 exit 0 [chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing [chefli:02619] [[40316,0],0] plm:base:report_launched all apps reported [chefli:02619] [[40316,0],0] plm:base:launch wiring up iof [chefli:02619] [[40316,0],0] plm:base:launch completed for job [40316,1] [squid_0:20023] [[40316,0],1] odls:launch reporting job [40316,1] launch status [squid_0:20023] [[40316,0],1] odls:launch setting waitpids [chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1] [squid_0:20023] [[40316,0],1] odls:wait_local_proc child process 20027 terminated [squid_0:20023] [[40316,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody@squid_0_0/2642149377/0/abort [squid_0:20023] [[40316,0],1] odls:waitpid_fired child process [[40316,1],0] terminated normally [squid_0:20023] [[40316,0],1] odls:notify_iof_complete for child [[40316,1],0] [chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,1] [squid_0:20023] [[40316,0],1] odls:proc_complete reporting all procs in [40316,1] terminated [chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,1],0] curnt state 2 new state 80 exit_code 0 [chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,1] - num_terminated 1 num_procs 1 [chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,1] normally terminated - checking all jobs [chefli:02619] [[40316,0],0] plm:base:check_job_completed all jobs terminated - waking up [chefli:02619] [[40316,0],0] plm:base:orted_cmd sending orted_exit commands [chefli:02619] [[40316,0],0] odls:kill_local_proc working on job [WILDCARD] [chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 1 num_procs 2 [chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1] [chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,0] [chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,0],1] curnt state 4 new state 80 exit_code 0 [chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 2 num_procs 2 [chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,0] normally terminated - checking all jobs [chefli:02619] [[40316,0],0] plm:base:receive stop comm [squid_0:20023] [[40316,0],1] odls:kill_local_proc working on job [WILDCARD]