Hi Ralph Thank you for your suggestions. I'll be happy to help you. I'm not sure if i'll get around to this tomorrow, but i certainly will do so on Monday.
Thanks Jody On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <r...@open-mpi.org> wrote: > Hi Jody > > I'm not sure when I'll get a chance to work on this - got a deadline to meet. > I do have a couple of suggestions, if you wouldn't mind helping debug the > problem? > > It looks to me like the problem is that mpirun is crashing or terminating > early for some reason - hence the failures to send msgs to it, and the > "lifeline lost" error that leads to the termination of the daemon. If you > build a debug version of the code (i.e., --enable-debug on configure), you > can get a lot of debug info that traces the behavior. > > If you could then run your program with > > -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached > > and send it to me, we'll see what ORTE thinks it is doing. > > You could also take a look at the code for implementing the xterm option. > You'll find it in > > orte/mca/odls/base/odls_base_default_fns.c > > around line 1115. The xterm command syntax is defined in > > orte/mca/odls/base/odls_base_open.c > > around line 233 and following. Note that we use "xterm -T" as the cmd. > Perhaps you can spot an error in the way we treat xterm? > > Also, remember that you have to specify that you want us to "hold" the xterm > window open even after the process terminates. If you don't specify it, the > window automatically closes upon completion of the process. So a fast-running > cmd like "hostname" might disappear so quickly that it causes a race > condition problem. > > You might want to try a spinner application - i.e.., output something and > then sit in a loop or sleep for some period of time. Or, use the "hold" > option to keep the window open - you designate "hold" by putting a '!' before > the rank, e.g., "mpirun -np 2 -xterm \!2 hostname" > > > On Apr 28, 2011, at 8:38 AM, jody wrote: > >> Hi >> >> Unfortunately this does not solve my problem. >> While i can do >> ssh -Y squid_0 xterm >> and this will open an xterm on m,y machiine (chefli), >> i run into problems with the -xterm option of openmpi: >> >> jody@chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh >> -Y" -host squid_0 --xterm 1 hostname >> squid_0 >> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >> lifeline [[35219,0],0] lost >> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >> lifeline [[35219,0],0] lost >> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >> >> By the way when i look at the DISPLAY variable in the xterm window >> opened via squid_0, >> i also have the display variable "localhost:11.0" >> >> Actually, the difference with using the "-mca plm_rsh_agent" is that >> the lines wiht the warnings about "xauth" and "untrusted X" do not >> appear: >> >> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 hostname >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> squid_0 >> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >> lifeline [[34926,0],0] lost >> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >> lifeline [[34926,0],0] lost >> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >> >> >> I have doubts that the "-Y" is passed correctly: >> jody@triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh >> -Y" -host squid_0 xterm >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> >> >> ---> as a matter of fact i noticed that the xterm option doesn't work >> locally: >> mpirun -np 4 -xterm 1 /usr/bin/printenv >> prints verything onto the console. >> >> Do you have any other suggestions i could try? >> >> Thank You >> Jody >> >> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> Should be able to just set >>> >>> -mca plm_rsh_agent "ssh -Y" >>> >>> on your cmd line, I believe >>> >>> On Apr 28, 2011, at 12:53 AM, jody wrote: >>> >>>> Hi Ralph >>>> >>>> Is there an easy way i could modify the OpenMPI code so that it would use >>>> the -Y option for ssh when connecting to remote machines? >>>> >>>> Thank You >>>> Jody >>>> >>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote: >>>>> Hi Ralph >>>>> thank you for your suggestions. After some fiddling, i found that after my >>>>> last update (gentoo) my sshd_config had been overwritten >>>>> (X11Forwarding was set to 'no'). >>>>> >>>>> After correcting that, i can now open remote terminals with 'ssh -Y' >>>>> and with 'ssh -X' >>>>> (but with '-X' is till get those xauth warnings) >>>>> >>>>> But the xterm option still doesn't work: >>>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 >>>>> printenv | grep WORLD_RANK >>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>> generated >>>>> Warning: No xauth data; using fake authentication data for X11 >>>>> forwarding. >>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>>>> OMPI_COMM_WORLD_RANK=0 >>>>> [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] >>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>>> [sd = 8] >>>>> [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to >>>>> lifeline [[54132,0],0] lost >>>>> >>>>> So it looks like the two processes from squid_0 can't open the display >>>>> this way, >>>>> but one of them writes the output to the console... >>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh >>>>> -Y' the >>>>> DISPLAY variable is set to 'localhost:10.0' >>>>> >>>>> So in what way would OMPI have to be adapted, so -xterm would work? >>>>> >>>>> Thank You >>>>> Jody >>>>> >>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> Here's a little more info - it's for Cygwin, but I don't see anything >>>>>> Cygwin-specific in the answers: >>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding >>>>>> >>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: >>>>>> >>>>>> Sorry Jody - I should have read your note more carefully to see that you >>>>>> already tried -Y. :-( >>>>>> Not sure what to suggest... >>>>>> >>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: >>>>>> >>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this >>>>>> result: >>>>>> >>>>>> When trying to set up x11 forwarding over an ssh session to a remote >>>>>> server >>>>>> with the -X switch, I was getting an error like Warning: No xauth >>>>>> data; using fake authentication data for X11 forwarding. >>>>>> >>>>>> When doing something like: >>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but >>>>>> I >>>>>> got an error message like: >>>>>> >>>>>> >>>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9 >>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>> generated >>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>> forwarding. >>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5 >>>>>> [root@RHEL ~]# >>>>>> and any X programs I ran would not display on my local system.. >>>>>> >>>>>> Turns out the solution is to use the -Y switch instead. >>>>>> >>>>>> ssh -Yl root 10.1.1.9 >>>>>> >>>>>> and that worked fine. >>>>>> >>>>>> See if that works for you - if it does, we may have to modify OMPI to >>>>>> accommodate. >>>>>> >>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote: >>>>>> >>>>>> Hi Ralph >>>>>> No, after the above error message mpirun has exited. >>>>>> >>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there: >>>>>> >>>>>> jody@chefli ~/share/neander $ ssh -Y squid_0 >>>>>> Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 >>>>>> jody@squid_0 ~ $ xterm >>>>>> xterm Xt error: Can't open display: >>>>>> xterm: DISPLAY is not set >>>>>> jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 >>>>>> jody@squid_0 ~ $ xterm >>>>>> xterm Xt error: Can't open display: 130.60.126.74:0.0 >>>>>> jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 >>>>>> jody@squid_0 ~ $ xterm >>>>>> xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>> jody@squid_0 ~ $ exit >>>>>> logout >>>>>> >>>>>> same thing with ssh -X, but here i get the same warning/error message >>>>>> as with mpirun: >>>>>> >>>>>> jody@chefli ~/share/neander $ ssh -X squid_0 >>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>> generated >>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>> forwarding. >>>>>> Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh >>>>>> >>>>>> So perhaps the whole problem is linked to that xauth-thing. >>>>>> Do you have a suggestion how this can be solved? >>>>>> >>>>>> Thank You >>>>>> Jody >>>>>> >>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>> If I read your error messages correctly, it looks like mpirun is >>>>>> crashing - >>>>>> the daemon is complaining that it lost the socket connection back to >>>>>> mpirun, >>>>>> and hence will abort. >>>>>> >>>>>> Are you seeing mpirun still alive? >>>>>> >>>>>> >>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote: >>>>>> >>>>>> Hi >>>>>> >>>>>> On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that >>>>>> >>>>>> it works in "text-mode": >>>>>> >>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK >>>>>> >>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>> >>>>>> OMPI_COMM_WORLD_RANK=1 >>>>>> >>>>>> OMPI_COMM_WORLD_RANK=2 >>>>>> >>>>>> OMPI_COMM_WORLD_RANK=3 >>>>>> >>>>>> but when i use the -xterm option to mpirun, it doesn't work >>>>>> >>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep >>>>>> WORLD_RANK >>>>>> >>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>> generated >>>>>> >>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>> forwarding. >>>>>> >>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>> >>>>>> [squid_0:05266] [[55607,0],1]->[[55607,0],0] >>>>>> >>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>>>>> >>>>>> [sd = 8] >>>>>> >>>>>> [squid_0:05266] [[55607,0],1] routed:binomial: Connection to >>>>>> >>>>>> lifeline [[55607,0],0] lost >>>>>> >>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>> >>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >>>>>> >>>>>> (strange: somebody wrote his message to the console) >>>>>> >>>>>> No matter whether i set the DISPLAY variable to the full hostname of >>>>>> >>>>>> the workstation, >>>>>> >>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work >>>>>> >>>>>> But i do have xauth data (as far as i know): >>>>>> >>>>>> On the remote (squid_0): >>>>>> >>>>>> jody@squid_0 ~ $ xauth list >>>>>> >>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>>> >>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>> >>>>>> chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>> >>>>>> on the workstation: >>>>>> >>>>>> $ xauth list >>>>>> >>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >>>>>> >>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >>>>>> >>>>>> localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1 >>>>>> >>>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>>> >>>>>> chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 >>>>>> 146c7f438fab79deb8a8a7df242b6f4b >>>>>> >>>>>> In sshd_config on the workstation i have 'X11Forwarding yes' >>>>>> >>>>>> I have also done >>>>>> >>>>>> xhost + squid_0 >>>>>> >>>>>> on the workstation. >>>>>> >>>>>> >>>>>> How can i get the -xterm option running? >>>>>> >>>>>> Thank You >>>>>> >>>>>> Jody >>>>>> >>>>>> _______________________________________________ >>>>>> >>>>>> users mailing list >>>>>> >>>>>> us...@open-mpi.org >>>>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> >>>>>> users mailing list >>>>>> >>>>>> us...@open-mpi.org >>>>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >