Re: [OMPI users] problems with the -xterm option

jody Mon, 2 May 2011 04:34:20 -0400

Hi Ralph

I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
The results are interesting!


I wrote a small HelloMPI app which basically calls usleep for a pause
of 5 seconds.

Now calling it as i did before, no MPI errors appear anymore, only the
display problems:
  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
  /usr/bin/xterm Xt error: Can't open display: localhost:10.0

When i do the same call *with* the debug option, the xterm appears and
shows the output of HelloMPI!
I attach the output in ompidbg_1.txt (It also works if i call with
'-np 4' and '--xterm 0,1,2,3'

Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).

If i use the hold-option, the xterm appears with the output of
'hostrname' (cf. ompidbg_3.txt)
The xterm opens after the line "launch complete for job..." has been
written (line 59)

I just found that everything works as expected if i use the the
'--leave-session-attached' option (without the debug options):
  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
./HelloMPI
The xterms are also opened if i do not use the '!' hold option.

What does *not* work is
  jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set

But then again, this call works (i.e. an xterm is opened) if all the
debug-options are used (ompidbg_4.txt).
Here the '--leave-session-attached' is necessary - without it, no xterm.

>From these results i would say that there is no basic mishandling of
'ssh', though i have no idea
what internal differences the use of the '-leave-session-attached'
option or the debug options make.

I hope these observations are helpful
  Jody


On Fri, Apr 29, 2011 at 12:08 AM, jody <jody....@gmail.com> wrote:
> Hi Ralph
>
> Thank you for your suggestions.
> I'll be happy to help  you.
> I'm not sure if i'll get around to this tomorrow,
> but i certainly will do so on Monday.
>
> Thanks
>  Jody
>
> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Hi Jody
>>
>> I'm not sure when I'll get a chance to work on this - got a deadline to 
>> meet. I do have a couple of suggestions, if you wouldn't mind helping debug 
>> the problem?
>>
>> It looks to me like the problem is that mpirun is crashing or terminating 
>> early for some reason - hence the failures to send msgs to it, and the 
>> "lifeline lost" error that leads to the termination of the daemon. If you 
>> build a debug version of the code (i.e., --enable-debug on configure), you 
>> can get a lot of debug info that traces the behavior.
>>
>> If you could then run your program with
>>
>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>
>> and send it to me, we'll see what ORTE thinks it is doing.
>>
>> You could also take a look at the code for implementing the xterm option. 
>> You'll find it in
>>
>> orte/mca/odls/base/odls_base_default_fns.c
>>
>> around line 1115. The xterm command syntax is defined in
>>
>> orte/mca/odls/base/odls_base_open.c
>>
>> around line 233 and following. Note that we use "xterm -T" as the cmd. 
>> Perhaps you can spot an error in the way we treat xterm?
>>
>> Also, remember that you have to specify that you want us to "hold" the xterm 
>> window open even after the process terminates. If you don't specify it, the 
>> window automatically closes upon completion of the process. So a 
>> fast-running cmd like "hostname" might disappear so quickly that it causes a 
>> race condition problem.
>>
>> You might want to try a spinner application - i.e.., output something and 
>> then sit in a loop or sleep for some period of time. Or, use the "hold" 
>> option to keep the window open - you designate "hold" by putting a '!' 
>> before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>
>>
>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>
>>> Hi
>>>
>>> Unfortunately this does not solve my problem.
>>> While i can do
>>>  ssh -Y squid_0 xterm
>>> and this will open an xterm on m,y machiine (chefli),
>>> i run into problems with the -xterm option of openmpi:
>>>
>>>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 --xterm 1 hostname
>>>  squid_0
>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>> By the way when i look at the DISPLAY variable in the xterm window
>>> opened via squid_0,
>>> i also have the display variable "localhost:11.0"
>>>
>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>> appear:
>>>
>>>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 
>>> hostname
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  squid_0
>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>>
>>> I have doubts that the "-Y" is passed correctly:
>>>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 xterm
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>
>>>
>>> ---> as a matter of fact i noticed that the xterm option doesn't work 
>>> locally:
>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>> prints verything onto the console.
>>>
>>> Do you have any other suggestions i could try?
>>>
>>> Thank You
>>> Jody
>>>
>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> Should be able to just set
>>>>
>>>> -mca plm_rsh_agent "ssh -Y"
>>>>
>>>> on your cmd line, I believe
>>>>
>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>
>>>>> Hi Ralph
>>>>>
>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>
>>>>> Thank You
>>>>>   Jody
>>>>>
>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote:
>>>>>> Hi Ralph
>>>>>> thank you for your suggestions. After some fiddling, i found that after 
>>>>>> my
>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>> (X11Forwarding was set to 'no').
>>>>>>
>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>> and with 'ssh -X'
>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>
>>>>>> But the xterm option still doesn't work:
>>>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>> printenv | grep WORLD_RANK
>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>> generated
>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>> forwarding.
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>> lifeline [[54132,0],0] lost
>>>>>>
>>>>>> So it looks like the two processes from squid_0 can't open the display 
>>>>>> this way,
>>>>>> but one of them writes the output to the console...
>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh 
>>>>>> -Y' the
>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>
>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>
>>>>>> Thank You
>>>>>>  Jody
>>>>>>
>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>> Cygwin-specific in the answers:
>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>> already tried -Y. :-(
>>>>>>> Not sure what to suggest...
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>> result:
>>>>>>>
>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote 
>>>>>>> server
>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>
>>>>>>> When doing something like:
>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, 
>>>>>>> but I
>>>>>>> got an error message like:
>>>>>>>
>>>>>>>
>>>>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>>> generated
>>>>>>> Warning: No xauth data; using fake authentication data for X11 
>>>>>>> forwarding.
>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>> [root@RHEL ~]#
>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>
>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>
>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>
>>>>>>> and that worked fine.
>>>>>>>
>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>> accommodate.
>>>>>>>
>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>
>>>>>>> Hi Ralph
>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>
>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm 
>>>>>>> there:
>>>>>>>
>>>>>>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>  jody@squid_0 ~ $ exit
>>>>>>>  logout
>>>>>>>
>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>> as with mpirun:
>>>>>>>
>>>>>>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>>> forwarding.
>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>
>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>
>>>>>>> Thank You
>>>>>>>  Jody
>>>>>>>
>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>
>>>>>>> If I read your error messages correctly, it looks like mpirun is 
>>>>>>> crashing -
>>>>>>> the daemon is complaining that it lost the socket connection back to 
>>>>>>> mpirun,
>>>>>>> and hence will abort.
>>>>>>>
>>>>>>> Are you seeing mpirun still alive?
>>>>>>>
>>>>>>>
>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>
>>>>>>> it works in "text-mode":
>>>>>>>
>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>
>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>
>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>> WORLD_RANK
>>>>>>>
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>>
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>>> forwarding.
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>
>>>>>>> [sd = 8]
>>>>>>>
>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>
>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>
>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>
>>>>>>> the workstation,
>>>>>>>
>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>
>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>
>>>>>>> On the remote (squid_0):
>>>>>>>
>>>>>>>  jody@squid_0 ~ $ xauth list
>>>>>>>
>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> on the workstation:
>>>>>>>
>>>>>>>  $ xauth list
>>>>>>>
>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>
>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  
>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>
>>>>>>> I have also done
>>>>>>>
>>>>>>>   xhost + squid_0
>>>>>>>
>>>>>>> on the workstation.
>>>>>>>
>>>>>>>
>>>>>>> How can i get the -xterm option running?
>>>>>>>
>>>>>>> Thank You
>>>>>>>
>>>>>>>  Jody
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> us...@open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> us...@open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent 
"ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 
--leave-session-attached  --xterm 0 ./HelloMPI 
[chefli:02420] mca:base:select:(  plm) Querying component [rsh]
[chefli:02420] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[chefli:02420] mca:base:select:(  plm) Querying component [slurm]
[chefli:02420] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
to return a module
[chefli:02420] mca:base:select:(  plm) Selected component [rsh]
[chefli:02420] plm:base:set_hnp_name: initial bias 2420 nodename hash 72192778
[chefli:02420] plm:base:set_hnp_name: final jobfam 40499
[chefli:02420] [[40499,0],0] plm:base:receive start comm
[chefli:02420] mca:base:select:( odls) Querying component [default]
[chefli:02420] mca:base:select:( odls) Query of component [default] set 
priority to 1
[chefli:02420] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:rsh: setting up job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:setup_job for job [40499,1]
[chefli:02420] [[40499,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02420] [[40499,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: final template argv:
        /usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 
2654142464 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri 
"2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02420] [[40499,0],0] plm:rsh: launching on node squid_0
[chefli:02420] [[40499,0],0] plm:rsh: recording launch of daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh 
-Y -X squid_0  orted -mca ess env -mca orte_ess_jobid 2654142464 -mca 
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19442] mca:base:select:( odls) Querying component [default]
[squid_0:19442] mca:base:select:( odls) Query of component [default] set 
priority to 1
[squid_0:19442] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch from daemon 
[[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch completed for daemon 
[[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback completed
[chefli:02420] [[40499,0],0] plm:base:launch_apps for job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:report_launched for job [40499,1]
[chefli:02420] [[40499,0],0] odls:constructing child list
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking data to launch 
job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list adding new jobdat for 
job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02420] [[40499,0],0] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[chefli:02420] [[40499,0],0] odls:construct:child: num_participating 1
[chefli:02420] [[40499,0],0] odls:launch found 12 processors for 0 children and 
set oversubscribed to false
[chefli:02420] [[40499,0],0] odls:launch reporting job [40499,1] launch status
[chefli:02420] [[40499,0],0] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon 
[[40499,0],0]
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[squid_0:19442] [[40499,0],1] odls:constructing child list
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking data to 
launch job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list adding new jobdat for 
job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19442] [[40499,0],1] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[squid_0:19442] [[40499,0],1] odls:constructing child list - found proc 0 for 
me!
[squid_0:19442] [[40499,0],1] odls:construct:child: num_participating 1
[squid_0:19442] [[40499,0],1] odls:launch found 4 processors for 1 children and 
set oversubscribed to false
[squid_0:19442] [[40499,0],1] odls:launch reporting job [40499,1] launch status
[squid_0:19442] [[40499,0],1] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch reissuing non-blocking 
recv
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon 
[[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:app_report_launched for proc 
[[40499,1],0] from daemon [[40499,0],1]: pid 19446 state 2 exit 0
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[chefli:02420] [[40499,0],0] plm:base:report_launched all apps reported
[chefli:02420] [[40499,0],0] plm:base:launch wiring up iof
[chefli:02420] [[40499,0],0] plm:base:launch completed for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:sync nidmap requested for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] 
with 144 bytes of data
[squid_0:19442] [[40499,0],1] odls: sending contact info to HNP
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 2 num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to 
parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 2 num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job 
[40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 15 on child 
[[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to 
parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job 
[40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child 
[[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to 
parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from 
[[40499,0],1] type 1 num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job 
[40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child 
[[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] 
with 0 bytes of data
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job 
[40499,1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc 
[[40499,1],0] curnt state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,1] - 
num_terminated 1  num_procs 1
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job 
[40499,1] normally terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:check_job_completed all jobs terminated - 
waking up
[chefli:02420] [[40499,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02420] [[40499,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - 
num_terminated 1  num_procs 2
[squid_0:19442] [[40499,0],1] odls:wait_local_proc child process 19446 
terminated
[squid_0:19442] [[40499,0],1] odls:notify_iof_complete for child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-jody@squid_0_0/2654142465/0/abort
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job 
[40499,0]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc 
[[40499,0],1] curnt state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - 
num_terminated 2  num_procs 2
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job 
[40499,0] normally terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:receive stop comm
[squid_0:19442] [[40499,0],1] odls:waitpid_fired child process [[40499,1],0] 
terminated normally
[squid_0:19442] [[40499,0],1] odls:proc_complete reporting all procs in 
[40499,1] terminated
[squid_0:19442] [[40499,0],1] odls:kill_local_proc working on job [WILDCARD]

jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent 
"ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 
--leave-session-attached  --xterm 0 hostname
[chefli:02476] mca:base:select:(  plm) Querying component [rsh]
[chefli:02476] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[chefli:02476] mca:base:select:(  plm) Querying component [slurm]
[chefli:02476] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
to return a module
[chefli:02476] mca:base:select:(  plm) Selected component [rsh]
[chefli:02476] plm:base:set_hnp_name: initial bias 2476 nodename hash 72192778
[chefli:02476] plm:base:set_hnp_name: final jobfam 40683
[chefli:02476] [[40683,0],0] plm:base:receive start comm
[chefli:02476] mca:base:select:( odls) Querying component [default]
[chefli:02476] mca:base:select:( odls) Query of component [default] set 
priority to 1
[chefli:02476] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:rsh: setting up job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:setup_job for job [40683,1]
[chefli:02476] [[40683,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02476] [[40683,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: final template argv:
        /usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 
2666201088 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri 
"2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02476] [[40683,0],0] plm:rsh: launching on node squid_0
[chefli:02476] [[40683,0],0] plm:rsh: recording launch of daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh 
-Y -X squid_0  orted -mca ess env -mca orte_ess_jobid 2666201088 -mca 
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19579] mca:base:select:( odls) Querying component [default]
[squid_0:19579] mca:base:select:( odls) Query of component [default] set 
priority to 1
[squid_0:19579] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch from daemon 
[[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch completed for daemon 
[[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback completed
[chefli:02476] [[40683,0],0] plm:base:launch_apps for job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:report_launched for job [40683,1]
[chefli:02476] [[40683,0],0] odls:constructing child list
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking data to launch 
job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list adding new jobdat for 
job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02476] [[40683,0],0] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[chefli:02476] [[40683,0],0] odls:construct:child: num_participating 1
[chefli:02476] [[40683,0],0] odls:launch found 12 processors for 0 children and 
set oversubscribed to false
[chefli:02476] [[40683,0],0] odls:launch reporting job [40683,1] launch status
[chefli:02476] [[40683,0],0] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon 
[[40683,0],0]
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[squid_0:19579] [[40683,0],1] odls:constructing child list
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking data to 
launch job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list adding new jobdat for 
job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19579] [[40683,0],1] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[squid_0:19579] [[40683,0],1] odls:constructing child list - found proc 0 for 
me!
[squid_0:19579] [[40683,0],1] odls:construct:child: num_participating 1
[squid_0:19579] [[40683,0],1] odls:launch found 4 processors for 1 children and 
set oversubscribed to false
[squid_0:19579] [[40683,0],1] odls:launch reporting job [40683,1] launch status
[squid_0:19579] [[40683,0],1] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch reissuing non-blocking 
recv
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon 
[[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:app_report_launched for proc 
[[40683,1],0] from daemon [[40683,0],1]: pid 19583 state 2 exit 0
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[chefli:02476] [[40683,0],0] plm:base:report_launched all apps reported
[chefli:02476] [[40683,0],0] plm:base:launch wiring up iof
[chefli:02476] [[40683,0],0] plm:base:launch completed for job [40683,1]
[squid_0:19579] [[40683,0],1] odls:wait_local_proc child process 19583 
terminated
[squid_0:19579] [[40683,0],1] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-jody@squid_0_0/2666201089/0/abort
[squid_0:19579] [[40683,0],1] odls:waitpid_fired child process [[40683,1],0] 
terminated normally
[squid_0:19579] [[40683,0],1] odls:notify_iof_complete for child [[40683,1],0]
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job 
[40683,1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc 
[[40683,1],0] curnt state 2 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,1] - 
num_terminated 1  num_procs 1
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job 
[40683,1] normally terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:check_job_completed all jobs terminated - 
waking up
[chefli:02476] [[40683,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02476] [[40683,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - 
num_terminated 1  num_procs 2
[squid_0:19579] [[40683,0],1] odls:proc_complete reporting all procs in 
[40683,1] terminated
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job 
[40683,0]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc 
[[40683,0],1] curnt state 4 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - 
num_terminated 2  num_procs 2
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job 
[40683,0] normally terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:receive stop comm
[squid_0:19579] [[40683,0],1] odls:kill_local_proc working on job [WILDCARD]

jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent 
"ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 
--leave-session-attached  --xterm 0! hostname
[chefli:02487] mca:base:select:(  plm) Querying component [rsh]
[chefli:02487] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[chefli:02487] mca:base:select:(  plm) Querying component [slurm]
[chefli:02487] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
to return a module
[chefli:02487] mca:base:select:(  plm) Selected component [rsh]
[chefli:02487] plm:base:set_hnp_name: initial bias 2487 nodename hash 72192778
[chefli:02487] plm:base:set_hnp_name: final jobfam 40688
[chefli:02487] [[40688,0],0] plm:base:receive start comm
[chefli:02487] mca:base:select:( odls) Querying component [default]
[chefli:02487] mca:base:select:( odls) Query of component [default] set 
priority to 1
[chefli:02487] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:rsh: setting up job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:setup_job for job [40688,1]
[chefli:02487] [[40688,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02487] [[40688,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: final template argv:
        /usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 
2666528768 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri 
"2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y"
[chefli:02487] [[40688,0],0] plm:rsh: launching on node squid_0
[chefli:02487] [[40688,0],0] plm:rsh: recording launch of daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh 
-Y -X squid_0  orted -mca ess env -mca orte_ess_jobid 2666528768 -mca 
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y"]
[squid_0:19613] mca:base:select:( odls) Querying component [default]
[squid_0:19613] mca:base:select:( odls) Query of component [default] set 
priority to 1
[squid_0:19613] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch from daemon 
[[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch completed for daemon 
[[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback completed
[chefli:02487] [[40688,0],0] plm:base:launch_apps for job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:report_launched for job [40688,1]
[chefli:02487] [[40688,0],0] odls:constructing child list
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking data to launch 
job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list adding new jobdat for 
job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02487] [[40688,0],0] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[chefli:02487] [[40688,0],0] odls:construct:child: num_participating 1
[chefli:02487] [[40688,0],0] odls:launch found 12 processors for 0 children and 
set oversubscribed to false
[chefli:02487] [[40688,0],0] odls:launch reporting job [40688,1] launch status
[chefli:02487] [[40688,0],0] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon 
[[40688,0],0]
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[squid_0:19613] [[40688,0],1] odls:constructing child list
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking data to 
launch job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list adding new jobdat for 
job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19613] [[40688,0],1] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[squid_0:19613] [[40688,0],1] odls:constructing child list - found proc 0 for 
me!
[squid_0:19613] [[40688,0],1] odls:construct:child: num_participating 1
[squid_0:19613] [[40688,0],1] odls:launch found 4 processors for 1 children and 
set oversubscribed to false
[squid_0:19613] [[40688,0],1] odls:launch reporting job [40688,1] launch status
[squid_0:19613] [[40688,0],1] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch reissuing non-blocking 
recv
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon 
[[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:app_report_launched for proc 
[[40688,1],0] from daemon [[40688,0],1]: pid 19617 state 2 exit 0
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[chefli:02487] [[40688,0],0] plm:base:report_launched all apps reported
[chefli:02487] [[40688,0],0] plm:base:launch wiring up iof
[chefli:02487] [[40688,0],0] plm:base:launch completed for job [40688,1]
[squid_0:19613] [[40688,0],1] odls:wait_local_proc child process 19617 
terminated
[squid_0:19613] [[40688,0],1] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-jody@squid_0_0/2666528769/0/abort
[squid_0:19613] [[40688,0],1] odls:waitpid_fired child process [[40688,1],0] 
terminated normally
[squid_0:19613] [[40688,0],1] odls:notify_iof_complete for child [[40688,1],0]
[squid_0:19613] [[40688,0],1] odls:proc_complete reporting all procs in 
[40688,1] terminated
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job 
[40688,1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc 
[[40688,1],0] curnt state 2 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,1] - 
num_terminated 1  num_procs 1
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job 
[40688,1] normally terminated - checking all jobs
[chefli:02487] [[40688,0],0] plm:base:check_job_completed all jobs terminated - 
waking up
[chefli:02487] [[40688,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02487] [[40688,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - 
num_terminated 1  num_procs 2
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job 
[40688,0]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc 
[[40688,0],1] curnt state 4 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - 
num_terminated 2  num_procs 2
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job 
[40688,0] normally terminated - checking all jobs
[squid_0:19613] [[40688,0],1] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:receive stop comm

jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent 
"ssh -Y" -mca plm_base_verbose 5 -mca odls_base_verbose 5 
--leave-session-attached  xterm
[chefli:02619] mca:base:select:(  plm) Querying component [rsh]
[chefli:02619] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[chefli:02619] mca:base:select:(  plm) Querying component [slurm]
[chefli:02619] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
to return a module
[chefli:02619] mca:base:select:(  plm) Selected component [rsh]
[chefli:02619] plm:base:set_hnp_name: initial bias 2619 nodename hash 72192778
[chefli:02619] plm:base:set_hnp_name: final jobfam 40316
[chefli:02619] [[40316,0],0] plm:base:receive start comm
[chefli:02619] mca:base:select:( odls) Querying component [default]
[chefli:02619] mca:base:select:( odls) Query of component [default] set 
priority to 1
[chefli:02619] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:rsh: setting up job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:setup_job for job [40316,1]
[chefli:02619] [[40316,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02619] [[40316,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: final template argv:
        /usr/bin/ssh -Y <template>  orted -mca ess env -mca orte_ess_jobid 
2642149376 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 --hnp-uri 
"2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y"
[chefli:02619] [[40316,0],0] plm:rsh: launching on node squid_0
[chefli:02619] [[40316,0],0] plm:rsh: recording launch of daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh 
-Y squid_0  orted -mca ess env -mca orte_ess_jobid 2642149376 -mca 
orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca 
odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y"]
[squid_0:20023] mca:base:select:( odls) Querying component [default]
[squid_0:20023] mca:base:select:( odls) Query of component [default] set 
priority to 1
[squid_0:20023] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch from daemon 
[[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch completed for daemon 
[[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback completed
[chefli:02619] [[40316,0],0] plm:base:launch_apps for job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:report_launched for job [40316,1]
[chefli:02619] [[40316,0],0] odls:constructing child list
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking data to launch 
job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list adding new jobdat for 
job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02619] [[40316,0],0] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[chefli:02619] [[40316,0],0] odls:construct:child: num_participating 1
[chefli:02619] [[40316,0],0] odls:launch found 12 processors for 0 children and 
set oversubscribed to false
[chefli:02619] [[40316,0],0] odls:launch reporting job [40316,1] launch status
[chefli:02619] [[40316,0],0] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon 
[[40316,0],0]
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[squid_0:20023] [[40316,0],1] odls:constructing child list
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking data to 
launch job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list adding new jobdat for 
job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:20023] [[40316,0],1] odls:constructing child list - checking proc 0 on 
node 1 with daemon 1
[squid_0:20023] [[40316,0],1] odls:constructing child list - found proc 0 for 
me!
[squid_0:20023] [[40316,0],1] odls:construct:child: num_participating 1
[squid_0:20023] [[40316,0],1] odls:launch found 4 processors for 1 children and 
set oversubscribed to false
[chefli:02619] [[40316,0],0] plm:base:app_report_launch reissuing non-blocking 
recv
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon 
[[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:app_report_launched for proc 
[[40316,1],0] from daemon [[40316,0],1]: pid 20027 state 2 exit 0
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[chefli:02619] [[40316,0],0] plm:base:report_launched all apps reported
[chefli:02619] [[40316,0],0] plm:base:launch wiring up iof
[chefli:02619] [[40316,0],0] plm:base:launch completed for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:launch reporting job [40316,1] launch status
[squid_0:20023] [[40316,0],1] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[squid_0:20023] [[40316,0],1] odls:wait_local_proc child process 20027 
terminated
[squid_0:20023] [[40316,0],1] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-jody@squid_0_0/2642149377/0/abort
[squid_0:20023] [[40316,0],1] odls:waitpid_fired child process [[40316,1],0] 
terminated normally
[squid_0:20023] [[40316,0],1] odls:notify_iof_complete for child [[40316,1],0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job 
[40316,1]
[squid_0:20023] [[40316,0],1] odls:proc_complete reporting all procs in 
[40316,1] terminated
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc 
[[40316,1],0] curnt state 2 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,1] - 
num_terminated 1  num_procs 1
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job 
[40316,1] normally terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:check_job_completed all jobs terminated - 
waking up
[chefli:02619] [[40316,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02619] [[40316,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - 
num_terminated 1  num_procs 2
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job 
[40316,0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc 
[[40316,0],1] curnt state 4 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - 
num_terminated 2  num_procs 2
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job 
[40316,0] normally terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:receive stop comm
[squid_0:20023] [[40316,0],1] odls:kill_local_proc working on job [WILDCARD]

Re: [OMPI users] problems with the -xterm option

Reply via email to