Hi Jody I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of suggestions, if you wouldn't mind helping debug the problem?
It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence the failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon. If you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info that traces the behavior. If you could then run your program with -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached and send it to me, we'll see what ORTE thinks it is doing. You could also take a look at the code for implementing the xterm option. You'll find it in orte/mca/odls/base/odls_base_default_fns.c around line 1115. The xterm command syntax is defined in orte/mca/odls/base/odls_base_open.c around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in the way we treat xterm? Also, remember that you have to specify that you want us to "hold" the xterm window open even after the process terminates. If you don't specify it, the window automatically closes upon completion of the process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem. You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!' before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname" On Apr 28, 2011, at 8:38 AM, jody wrote: > Hi > > Unfortunately this does not solve my problem. > While i can do > ssh -Y squid_0 xterm > and this will open an xterm on m,y machiine (chefli), > i run into problems with the -xterm option of openmpi: > > jody@chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh > -Y" -host squid_0 --xterm 1 hostname > squid_0 > [squid_0:28046] [[35219,0],1]->[[35219,0],0] > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) > [sd = 8] > [squid_0:28046] [[35219,0],1] routed:binomial: Connection to > lifeline [[35219,0],0] lost > [squid_0:28046] [[35219,0],1]->[[35219,0],0] > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) > [sd = 8] > [squid_0:28046] [[35219,0],1] routed:binomial: Connection to > lifeline [[35219,0],0] lost > /usr/bin/xterm Xt error: Can't open display: localhost:11.0 > > By the way when i look at the DISPLAY variable in the xterm window > opened via squid_0, > i also have the display variable "localhost:11.0" > > Actually, the difference with using the "-mca plm_rsh_agent" is that > the lines wiht the warnings about "xauth" and "untrusted X" do not > appear: > > jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 hostname > Warning: untrusted X11 forwarding setup failed: xauth key data not generated > Warning: No xauth data; using fake authentication data for X11 forwarding. > squid_0 > [squid_0:28337] [[34926,0],1]->[[34926,0],0] > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) > [sd = 8] > [squid_0:28337] [[34926,0],1] routed:binomial: Connection to > lifeline [[34926,0],0] lost > [squid_0:28337] [[34926,0],1]->[[34926,0],0] > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) > [sd = 8] > [squid_0:28337] [[34926,0],1] routed:binomial: Connection to > lifeline [[34926,0],0] lost > /usr/bin/xterm Xt error: Can't open display: localhost:11.0 > > > I have doubts that the "-Y" is passed correctly: > jody@triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh > -Y" -host squid_0 xterm > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > > > ---> as a matter of fact i noticed that the xterm option doesn't work locally: > mpirun -np 4 -xterm 1 /usr/bin/printenv > prints verything onto the console. > > Do you have any other suggestions i could try? > > Thank You > Jody > > On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote: >> Should be able to just set >> >> -mca plm_rsh_agent "ssh -Y" >> >> on your cmd line, I believe >> >> On Apr 28, 2011, at 12:53 AM, jody wrote: >> >>> Hi Ralph >>> >>> Is there an easy way i could modify the OpenMPI code so that it would use >>> the -Y option for ssh when connecting to remote machines? >>> >>> Thank You >>> Jody >>> >>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote: >>>> Hi Ralph >>>> thank you for your suggestions. After some fiddling, i found that after my >>>> last update (gentoo) my sshd_config had been overwritten >>>> (X11Forwarding was set to 'no'). >>>> >>>> After correcting that, i can now open remote terminals with 'ssh -Y' >>>> and with 'ssh -X' >>>> (but with '-X' is till get those xauth warnings) >>>> >>>> But the xterm option still doesn't work: >>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 >>>> printenv | grep WORLD_RANK >>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>> generated >>>> Warning: No xauth data; using fake authentication data for X11 forwarding. >>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>> OMPI_COMM_WORLD_RANK=0 >>>> [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] >>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>> [sd = 8] >>>> [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to >>>> lifeline [[54132,0],0] lost >>>> >>>> So it looks like the two processes from squid_0 can't open the display >>>> this way, >>>> but one of them writes the output to the console... >>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' >>>> the >>>> DISPLAY variable is set to 'localhost:10.0' >>>> >>>> So in what way would OMPI have to be adapted, so -xterm would work? >>>> >>>> Thank You >>>> Jody >>>> >>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> Here's a little more info - it's for Cygwin, but I don't see anything >>>>> Cygwin-specific in the answers: >>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding >>>>> >>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: >>>>> >>>>> Sorry Jody - I should have read your note more carefully to see that you >>>>> already tried -Y. :-( >>>>> Not sure what to suggest... >>>>> >>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: >>>>> >>>>> Like I said, I'm not expert. However, a quick "google" of revealed this >>>>> result: >>>>> >>>>> When trying to set up x11 forwarding over an ssh session to a remote >>>>> server >>>>> with the -X switch, I was getting an error like Warning: No xauth >>>>> data; using fake authentication data for X11 forwarding. >>>>> >>>>> When doing something like: >>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I >>>>> got an error message like: >>>>> >>>>> >>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9 >>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>> generated >>>>> Warning: No xauth data; using fake authentication data for X11 forwarding. >>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5 >>>>> [root@RHEL ~]# >>>>> and any X programs I ran would not display on my local system.. >>>>> >>>>> Turns out the solution is to use the -Y switch instead. >>>>> >>>>> ssh -Yl root 10.1.1.9 >>>>> >>>>> and that worked fine. >>>>> >>>>> See if that works for you - if it does, we may have to modify OMPI to >>>>> accommodate. >>>>> >>>>> On Apr 6, 2011, at 9:19 AM, jody wrote: >>>>> >>>>> Hi Ralph >>>>> No, after the above error message mpirun has exited. >>>>> >>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there: >>>>> >>>>> jody@chefli ~/share/neander $ ssh -Y squid_0 >>>>> Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 >>>>> jody@squid_0 ~ $ xterm >>>>> xterm Xt error: Can't open display: >>>>> xterm: DISPLAY is not set >>>>> jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 >>>>> jody@squid_0 ~ $ xterm >>>>> xterm Xt error: Can't open display: 130.60.126.74:0.0 >>>>> jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 >>>>> jody@squid_0 ~ $ xterm >>>>> xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>> jody@squid_0 ~ $ exit >>>>> logout >>>>> >>>>> same thing with ssh -X, but here i get the same warning/error message >>>>> as with mpirun: >>>>> >>>>> jody@chefli ~/share/neander $ ssh -X squid_0 >>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>> generated >>>>> Warning: No xauth data; using fake authentication data for X11 >>>>> forwarding. >>>>> Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh >>>>> >>>>> So perhaps the whole problem is linked to that xauth-thing. >>>>> Do you have a suggestion how this can be solved? >>>>> >>>>> Thank You >>>>> Jody >>>>> >>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> If I read your error messages correctly, it looks like mpirun is crashing >>>>> - >>>>> the daemon is complaining that it lost the socket connection back to >>>>> mpirun, >>>>> and hence will abort. >>>>> >>>>> Are you seeing mpirun still alive? >>>>> >>>>> >>>>> On Apr 5, 2011, at 4:46 AM, jody wrote: >>>>> >>>>> Hi >>>>> >>>>> On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that >>>>> >>>>> it works in "text-mode": >>>>> >>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK >>>>> >>>>> OMPI_COMM_WORLD_RANK=0 >>>>> >>>>> OMPI_COMM_WORLD_RANK=1 >>>>> >>>>> OMPI_COMM_WORLD_RANK=2 >>>>> >>>>> OMPI_COMM_WORLD_RANK=3 >>>>> >>>>> but when i use the -xterm option to mpirun, it doesn't work >>>>> >>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep >>>>> WORLD_RANK >>>>> >>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>> generated >>>>> >>>>> Warning: No xauth data; using fake authentication data for X11 >>>>> forwarding. >>>>> >>>>> OMPI_COMM_WORLD_RANK=0 >>>>> >>>>> [squid_0:05266] [[55607,0],1]->[[55607,0],0] >>>>> >>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>>> >>>>> [sd = 8] >>>>> >>>>> [squid_0:05266] [[55607,0],1] routed:binomial: Connection to >>>>> >>>>> lifeline [[55607,0],0] lost >>>>> >>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>> >>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>> >>>>> (strange: somebody wrote his message to the console) >>>>> >>>>> No matter whether i set the DISPLAY variable to the full hostname of >>>>> >>>>> the workstation, >>>>> >>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work >>>>> >>>>> But i do have xauth data (as far as i know): >>>>> >>>>> On the remote (squid_0): >>>>> >>>>> jody@squid_0 ~ $ xauth list >>>>> >>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>> >>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>> >>>>> chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>> >>>>> on the workstation: >>>>> >>>>> $ xauth list >>>>> >>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>> >>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>> >>>>> localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1 >>>>> >>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>> >>>>> chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 >>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>> >>>>> In sshd_config on the workstation i have 'X11Forwarding yes' >>>>> >>>>> I have also done >>>>> >>>>> xhost + squid_0 >>>>> >>>>> on the workstation. >>>>> >>>>> >>>>> How can i get the -xterm option running? >>>>> >>>>> Thank You >>>>> >>>>> Jody >>>>> >>>>> _______________________________________________ >>>>> >>>>> users mailing list >>>>> >>>>> us...@open-mpi.org >>>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> >>>>> users mailing list >>>>> >>>>> us...@open-mpi.org >>>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users