Hi Ralph

Thank you for your suggestions.
I'll be happy to help  you.
I'm not sure if i'll get around to this tomorrow,
but i certainly will do so on Monday.

Thanks
  Jody

On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Hi Jody
>
> I'm not sure when I'll get a chance to work on this - got a deadline to meet. 
> I do have a couple of suggestions, if you wouldn't mind helping debug the 
> problem?
>
> It looks to me like the problem is that mpirun is crashing or terminating 
> early for some reason - hence the failures to send msgs to it, and the 
> "lifeline lost" error that leads to the termination of the daemon. If you 
> build a debug version of the code (i.e., --enable-debug on configure), you 
> can get a lot of debug info that traces the behavior.
>
> If you could then run your program with
>
>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>
> and send it to me, we'll see what ORTE thinks it is doing.
>
> You could also take a look at the code for implementing the xterm option. 
> You'll find it in
>
> orte/mca/odls/base/odls_base_default_fns.c
>
> around line 1115. The xterm command syntax is defined in
>
> orte/mca/odls/base/odls_base_open.c
>
> around line 233 and following. Note that we use "xterm -T" as the cmd. 
> Perhaps you can spot an error in the way we treat xterm?
>
> Also, remember that you have to specify that you want us to "hold" the xterm 
> window open even after the process terminates. If you don't specify it, the 
> window automatically closes upon completion of the process. So a fast-running 
> cmd like "hostname" might disappear so quickly that it causes a race 
> condition problem.
>
> You might want to try a spinner application - i.e.., output something and 
> then sit in a loop or sleep for some period of time. Or, use the "hold" 
> option to keep the window open - you designate "hold" by putting a '!' before 
> the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>
>
> On Apr 28, 2011, at 8:38 AM, jody wrote:
>
>> Hi
>>
>> Unfortunately this does not solve my problem.
>> While i can do
>>  ssh -Y squid_0 xterm
>> and this will open an xterm on m,y machiine (chefli),
>> i run into problems with the -xterm option of openmpi:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 --xterm 1 hostname
>>  squid_0
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>> By the way when i look at the DISPLAY variable in the xterm window
>> opened via squid_0,
>> i also have the display variable "localhost:11.0"
>>
>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>> appear:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  squid_0
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>>
>> I have doubts that the "-Y" is passed correctly:
>>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>
>>
>> ---> as a matter of fact i noticed that the xterm option doesn't work 
>> locally:
>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>> prints verything onto the console.
>>
>> Do you have any other suggestions i could try?
>>
>> Thank You
>> Jody
>>
>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Should be able to just set
>>>
>>> -mca plm_rsh_agent "ssh -Y"
>>>
>>> on your cmd line, I believe
>>>
>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>
>>>> Hi Ralph
>>>>
>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>> the -Y option for ssh when connecting to remote machines?
>>>>
>>>> Thank You
>>>>   Jody
>>>>
>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote:
>>>>> Hi Ralph
>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>> (X11Forwarding was set to 'no').
>>>>>
>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>> and with 'ssh -X'
>>>>> (but with '-X' is till get those xauth warnings)
>>>>>
>>>>> But the xterm option still doesn't work:
>>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>> printenv | grep WORLD_RANK
>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>> generated
>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>> forwarding.
>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>> [sd = 8]
>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>> lifeline [[54132,0],0] lost
>>>>>
>>>>> So it looks like the two processes from squid_0 can't open the display 
>>>>> this way,
>>>>> but one of them writes the output to the console...
>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh 
>>>>> -Y' the
>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>
>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>
>>>>> Thank You
>>>>>  Jody
>>>>>
>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>> Cygwin-specific in the answers:
>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>
>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>
>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>> already tried -Y. :-(
>>>>>> Not sure what to suggest...
>>>>>>
>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>
>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>> result:
>>>>>>
>>>>>> When trying to set up x11 forwarding over an ssh session to a remote 
>>>>>> server
>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>
>>>>>> When doing something like:
>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but 
>>>>>> I
>>>>>> got an error message like:
>>>>>>
>>>>>>
>>>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>> generated
>>>>>> Warning: No xauth data; using fake authentication data for X11 
>>>>>> forwarding.
>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>> [root@RHEL ~]#
>>>>>> and any X programs I ran would not display on my local system..
>>>>>>
>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>
>>>>>> ssh -Yl root 10.1.1.9
>>>>>>
>>>>>> and that worked fine.
>>>>>>
>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>> accommodate.
>>>>>>
>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>
>>>>>> Hi Ralph
>>>>>> No, after the above error message mpirun has exited.
>>>>>>
>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>
>>>>>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>  xterm Xt error: Can't open display:
>>>>>>  xterm:  DISPLAY is not set
>>>>>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>  jody@squid_0 ~ $ exit
>>>>>>  logout
>>>>>>
>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>> as with mpirun:
>>>>>>
>>>>>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>> generated
>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>> forwarding.
>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>
>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>> Do you have a suggestion how this can be solved?
>>>>>>
>>>>>> Thank You
>>>>>>  Jody
>>>>>>
>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>
>>>>>> If I read your error messages correctly, it looks like mpirun is 
>>>>>> crashing -
>>>>>> the daemon is complaining that it lost the socket connection back to 
>>>>>> mpirun,
>>>>>> and hence will abort.
>>>>>>
>>>>>> Are you seeing mpirun still alive?
>>>>>>
>>>>>>
>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>
>>>>>> it works in "text-mode":
>>>>>>
>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>
>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>
>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>
>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>
>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>
>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>
>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>> WORLD_RANK
>>>>>>
>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>> generated
>>>>>>
>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>> forwarding.
>>>>>>
>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>
>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>
>>>>>> [sd = 8]
>>>>>>
>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>
>>>>>> lifeline [[55607,0],0] lost
>>>>>>
>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>
>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>
>>>>>> (strange: somebody wrote his message to the console)
>>>>>>
>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>
>>>>>> the workstation,
>>>>>>
>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>
>>>>>> But i do have xauth data (as far as i know):
>>>>>>
>>>>>> On the remote (squid_0):
>>>>>>
>>>>>>  jody@squid_0 ~ $ xauth list
>>>>>>
>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>
>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>
>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>
>>>>>> on the workstation:
>>>>>>
>>>>>>  $ xauth list
>>>>>>
>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>
>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>
>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>
>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>
>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  
>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>
>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>
>>>>>> I have also done
>>>>>>
>>>>>>   xhost + squid_0
>>>>>>
>>>>>> on the workstation.
>>>>>>
>>>>>>
>>>>>> How can i get the -xterm option running?
>>>>>>
>>>>>> Thank You
>>>>>>
>>>>>>  Jody
>>>>>>
>>>>>> _______________________________________________
>>>>>>
>>>>>> users mailing list
>>>>>>
>>>>>> us...@open-mpi.org
>>>>>>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>>
>>>>>> users mailing list
>>>>>>
>>>>>> us...@open-mpi.org
>>>>>>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to