[OMPI users] openmpi 1.2.8 on Xgrid noob issue

2011-08-04 Thread Christopher Jones
Hi there,

I'm currently trying to set up a small xgrid between two mac pros (a single 
quadcore and a 2 duo core), where both are directly connected via an ethernet 
cable. I've set up xgrid using the password authentication (rather than the 
kerberos), and from what I can tell in the Xgrid admin tool it seems to be 
working. However, once I try a simple hello world program, I get this error:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello
mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on 
signal 15 (Terminated).
1 additional process aborted (not shown)
2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to uncaught 
exception 'NSInvalidArgumentException', reason: '*** 
-[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when collecting 
not enabled'
*** Call stack at first throw:
(
0   CoreFoundation  0x7fff814237b4 
__exceptionPreprocess + 180
1   libobjc.A.dylib 0x7fff84fe8f03 objc_exception_throw 
+ 45
2   CoreFoundation  0x7fff8143e631 -[NSObject(NSObject) 
finalize] + 129
3   mca_pls_xgrid.so0x0001002a9ce3 -[PlsXGridClient 
dealloc] + 419
4   mca_pls_xgrid.so0x0001002a9837 
orte_pls_xgrid_finalize + 40
5   libopen-rte.0.dylib 0x00010002d0f9 orte_pls_base_close 
+ 249
6   libopen-rte.0.dylib 0x000100012027 orte_system_finalize 
+ 119
7   libopen-rte.0.dylib 0x0001e968 orte_finalize + 40
8   mpirun  0x000111ff orterun + 2042
9   mpirun  0x00010a03 main + 27
10  mpirun  0x000109e0 start + 52
11  ??? 0x0004 0x0 + 4
)
terminate called after throwing an instance of 'NSException'
[chris-joness-mac-pro:00350] *** Process received signal ***
[chris-joness-mac-pro:00350] Signal: Abort trap (6)
[chris-joness-mac-pro:00350] Signal code:  (0)
[chris-joness-mac-pro:00350] [ 0] 2   libSystem.B.dylib   
0x7fff81ca51ba _sigtramp + 26
[chris-joness-mac-pro:00350] [ 1] 3   ??? 
0x0001000cd400 0x0 + 4295808000
[chris-joness-mac-pro:00350] [ 2] 4   libstdc++.6.dylib   
0x7fff830965d2 __tcf_0 + 0
[chris-joness-mac-pro:00350] [ 3] 5   libobjc.A.dylib 
0x7fff84fecb39 _objc_terminate + 100
[chris-joness-mac-pro:00350] [ 4] 6   libstdc++.6.dylib   
0x7fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
[chris-joness-mac-pro:00350] [ 5] 7   libstdc++.6.dylib   
0x7fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
[chris-joness-mac-pro:00350] [ 6] 8   libstdc++.6.dylib   
0x7fff83094bfc 
_ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[chris-joness-mac-pro:00350] [ 7] 9   libobjc.A.dylib 
0x7fff84fe8fa2 object_getIvar + 0
[chris-joness-mac-pro:00350] [ 8] 10  CoreFoundation  
0x7fff8143e631 -[NSObject(NSObject) finalize] + 129
[chris-joness-mac-pro:00350] [ 9] 11  mca_pls_xgrid.so
0x0001002a9ce3 -[PlsXGridClient dealloc] + 419
[chris-joness-mac-pro:00350] [10] 12  mca_pls_xgrid.so
0x0001002a9837 orte_pls_xgrid_finalize + 40
[chris-joness-mac-pro:00350] [11] 13  libopen-rte.0.dylib 
0x00010002d0f9 orte_pls_base_close + 249
[chris-joness-mac-pro:00350] [12] 14  libopen-rte.0.dylib 
0x000100012027 orte_system_finalize + 119
[chris-joness-mac-pro:00350] [13] 15  libopen-rte.0.dylib 
0x0001e968 orte_finalize + 40
[chris-joness-mac-pro:00350] [14] 16  mpirun  
0x000111ff orterun + 2042
[chris-joness-mac-pro:00350] [15] 17  mpirun  
0x00010a03 main + 27
[chris-joness-mac-pro:00350] [16] 18  mpirun  
0x000109e0 start + 52
[chris-joness-mac-pro:00350] [17] 19  ??? 
0x0004 0x0 + 4
[chris-joness-mac-pro:00350] *** End of error message ***
Abort trap


I've seen this error in a previous mailing, and it seems that the issue has 
something to do with forcing everything to use kerberos (SSO). However, I 
noticed that in the computer being used as an agent, this option is grayed on 
in the Xgrid sharing configuration (I have no idea why). I would therefore ask 
if it is absolutely necessary to use SSO to get openmpi to run with xgrid, or 
am I missing something with the password setup. Seems that the kerberos option 
is much more complicated, and I may even want to switch to just using openmpi 
with ssh.

Many thanks,
Chris


Chris Jones
Post-doctoral Research Assistant,

Department of Microbiology
Swedish University of Agricultural Sciences
Uppsala, Sweden
phone:

[OMPI users] Open MPI via SSH noob issue

2011-08-09 Thread Christopher Jones
(11.8 GB)
>
> Two different NICs are on the same subnet -- that doesn't seem like a good 
> idea...?  I think this topic has come up on the users list before, and, IIRC, 
> the general consensus is "don't do that" because it's not clear as to which 
> NIC Linux will actually send outgoing traffic across bound for the 
> 192.168.1.x subnet.
>
>
>
> On Aug 4, 2011, at 1:59 PM, Keith Manville wrote:
>
>> I am having trouble running my MPI program on multiple nodes. I can
>> run multiple processes on a single node, and I can spawn processes on
>> on remote nodes, but when I call Send from a remote node, the node
>> never returns, even though there is an appropriate Recv waiting. I'm
>> pretty sure this is an issue with my configuration, not my code. I've
>> tried some other sample programs I found and had the same problem of
>> hanging on a send from one host to another.
>>
>> Here's an in depth description:
>>
>> I wrote a quick test program where each process with rank > 1 sends an
>> int to the master (rank 0), and the master receives until it gets
>> something from every other process.
>>
>> My test program works fine when I run multiple processes on a single machine.
>>
>> either the local node:
>>
>> $ ./mpirun -n 4 ./mpi-test
>> Hi I'm localhost:2
>> Hi I'm localhost:1
>> localhost:1 sending 11...
>> localhost:2 sending 12...
>> localhost:2 sent 12
>> localhost:1 sent 11
>> Hi I'm localhost:0
>> localhost:0 received 11 from 1
>> localhost:0 received 12 from 2
>> Hi I'm localhost:3
>> localhost:3 sending 13...
>> localhost:3 sent 13
>> localhost:0 received 13 from 3
>> all workers checked in!
>>
>> or a remote one:
>>
>> $ ./mpirun -np 2 -host remotehost ./mpi-test
>> Hi I'm remotehost:0
>> remotehost:0 received 11 from 1
>> all workers checked in!
>> Hi I'm remotehost:1
>> remotehost:1 sending 11...
>> remotehost:1 sent 11
>>
>> But when I try to run the master locally and the worker(s) remotely
>> (this is the way I am actually interested in running it), Send never
>> returns and it hangs indefinitely.
>>
>> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
>> Hi I'm localhost:0
>> Hi I'm remotehost:1
>> remotehost:1 sending 11...
>>
>> Just to see if it would work, I tried spawning the master on the
>> remotehost and the worker on the localhost.
>>
>> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
>> Hi I'm localhost:1
>> localhost:1 sending 11...
>> localhost:1 sent 11
>> Hi I'm remotehost:0
>> remotehost:0 received 0 from 1
>> all workers checked in!
>>
>> It doesn't hang on Send, but the wrong value is received.
>>
>> Any idea what's going on? I've attached my code, my config.log,
>> ifconfig output, and ompi_info output.
>>
>> Thanks,
>> Keith
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> --
>
> Message: 4
> Date: Thu, 4 Aug 2011 20:48:30 -0400
> From: Jeff Squyres 
> Subject: Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=us-ascii
>
> I'm afraid our Xgrid support has lagged, and Apple hasn't show much interest 
> in MPI + Xgrid support -- much less HPC.  :-\
>
> Have you see the FAQ items about Xgrid?
>
>http://www.open-mpi.org/faq/?category=osx#xgrid-howto
>
>
> On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote:
>
>> Hi there,
>>
>> I'm currently trying to set up a small xgrid between two mac pros (a single 
>> quadcore and a 2 duo core), where both are directly connected via an 
>> ethernet cable. I've set up xgrid using the password authentication (rather 
>> than the kerberos), and from what I can tell in the Xgrid admin tool it 
>> seems to be working. However, once I try a simple hello world program, I get 
>> this error:
>>
>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello
>> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on 
>> signal 15 (Terminated).
>> 1 additional process aborted (not shown)

Re: [OMPI users] Open MPI via SSH noob issue

2011-08-10 Thread Christopher Jones
Hi,

Thanks for the quick response.I managed to compile 1.5.3 on both computers 
using gcc-4.2, with the proper flags set (this took a bit of playing with, but 
I did eventually get it to compile). Once that was done, I installed it to a 
different directory from 1.2.8 (/opt/local/openmpi/), specified the PATH and 
LD_LIBRARY_PATH for the new version on each computer, then managed to get the 
hello_world script to run again so it could call each process, like before. 
However, I'm still in the same place - ring_c freezes up. I tried changing the 
hostname in the host file (just for poops and giggles - I see the response 
stating it doesn't matter), but to no avail. I made sure the firewall is off on 
both computers.

I'm hoping I'm not doing something overly dumb here, but I'm still a bit 
stuck...I see in the FAQ that there were some issues with nehalem processors - 
I have two Xeons in one box and a nehalem in another. Could this make any 
difference?

Thanks again,
Chris

On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote:

> No, Open MPI doesn't use the names in the hostfile to figure out which TCP/IP 
> addresses to use (for example).  Each process ends up publishing a list of IP 
> addresses at which it can be connected, and OMPI does routability 
> computations to figure out which is the "best" address to contact a given 
> peer on.
>
> If you're just starting with Open MPI, can you upgrade?  1.2.8 is pretty 
> ancient.  Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our 
> "feature" series, but it's also relatively stable (new releases are coming in 
> both the 1.4.x and 1.5.x series soon, FWIW).
>
>
> On Aug 9, 2011, at 12:14 PM, David Warren wrote:
>
>> I don't know if this is it, but if you use the name localhost, won't 
>> processes on both machines try to talk to 127.0.0.1? I believe you need to 
>> use the real hostname in you host file. I think that your two tests work 
>> because there is no interprocess communication, just stdout.
>>
>> On 08/08/11 23:46, Christopher Jones wrote:
>>> Hi again,
>>>
>>> I changed the subject of my previous posting to reflect a new problem 
>>> encountered when I changed my strategy to using SSH instead of Xgrid on two 
>>> mac pros. I've set up a login-less ssh communication between the two macs 
>>> (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) 
>>> per the instructions on the FAQ. I can type in 'ssh computer-name.local' on 
>>> either computer and connect without a password prompt. From what I can see, 
>>> the ssh-agent is up and running - the following is listed in my ENV:
>>>
>>> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
>>> SSH_AGENT_PID=61058
>>>
>>> My host file simply lists 'localhost' and 
>>> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
>>> test, I get what seems like a reasonable output:
>>>
>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile 
>>> ./test_hello
>>> Hello world from process 0 of 8
>>> Hello world from process 1 of 8
>>> Hello world from process 2 of 8
>>> Hello world from process 3 of 8
>>> Hello world from process 4 of 8
>>> Hello world from process 7 of 8
>>> Hello world from process 5 of 8
>>> Hello world from process 6 of 8
>>>
>>> I can also run hostname and get what seems to be an ok response (unless I'm 
>>> wrong about this):
>>>
>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
>>> allana-welshs-mac-pro.local
>>> allana-welshs-mac-pro.local
>>> allana-welshs-mac-pro.local
>>> allana-welshs-mac-pro.local
>>> quadcore.mikrob.slu.se
>>> quadcore.mikrob.slu.se
>>> quadcore.mikrob.slu.se
>>> quadcore.mikrob.slu.se
>>>
>>>
>>> However, when I run the ring_c test, it freezes:
>>>
>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
>>> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>>
>>> (I noted that processors on both computers are active).
>>>
>>> ring_c was compiled separately on each computer, however both have the same 
>>> version of openmpi and OSX. I've gone through the FAQ and searched the user 
>>> forum, but I can't quite seems to get this problem unstuck.
>>>
>>> Many thanks for your time,
>>> Chris
>

Re: [OMPI users] Open MPI via SSH noob issue

2011-08-11 Thread Christopher Jones
Hi again,

Ok - I see what I missed in the FAQ, sorry about thatmy understanding of 
the shell is a bit minimal to say the least. I now have my .bashrc files 
configured as such on both computers:

export LD_LIBRARY_PATH=/opt/local/openmpi/lib:{$PATH}
export PATH=/opt/local/openmpi/bin:{$PATH}

However, I am now running into a new issue that is still cryptic to me:

quadcore:~ chrisjones$ /opt/local/openmpi/bin/mpirun -np 8 -hostfile hostfile 
./ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
[quadcore.mikrob.slu.se][[53435,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 127.0.0.1 failed: Connection refused (61)

This may be superfluous, but I can connect to the localhost (ssh localhost) 
with no password prompt...Is there an ssh port I need to change somewhere?

Again, thanks for your patience and help.

Chris


Have you setup your shell startup files such that they point to the new OMPI 
installation (/opt/local/openmpi/) even for non-interactive logins?


Hi,

Thanks for the quick response.I managed to compile 1.5.3 on both computers 
using gcc-4.2, with the proper flags set (this took a bit of playing with, but 
I did eventually get it to compile). Once that was done, I installed it to a 
different directory from 1.2.8 (/opt/local/openmpi/), specified the PATH and 
LD_LIBRARY_PATH for the new version on each computer, then managed to get the 
hello_world script to run again so it could call each process, like before. 
However, I'm still in the same place - ring_c freezes up. I tried changing the 
hostname in the host file (just for poops and giggles - I see the response 
stating it doesn't matter), but to no avail. I made sure the firewall is off on 
both computers.

I'm hoping I'm not doing something overly dumb here, but I'm still a bit 
stuck...I see in the FAQ that there were some issues with nehalem processors - 
I have two Xeons in one box and a nehalem in another. Could this make any 
difference?

Thanks again,
Chris

On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote:

No, Open MPI doesn't use the names in the hostfile to figure out which TCP/IP 
addresses to use (for example).  Each process ends up publishing a list of IP 
addresses at which it can be connected, and OMPI does routability computations 
to figure out which is the "best" address to contact a given peer on.

If you're just starting with Open MPI, can you upgrade?  1.2.8 is pretty 
ancient.  Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our 
"feature" series, but it's also relatively stable (new releases are coming in 
both the 1.4.x and 1.5.x series soon, FWIW).


On Aug 9, 2011, at 12:14 PM, David Warren wrote:

I don't know if this is it, but if you use the name localhost, won't processes 
on both machines try to talk to 127.0.0.1? I believe you need to use the real 
hostname in you host file. I think that your two tests work because there is no 
interprocess communication, just stdout.

On 08/08/11 23:46, Christopher Jones wrote:
Hi again,

I changed the subject of my previous posting to reflect a new problem 
encountered when I changed my strategy to using SSH instead of Xgrid on two mac 
pros. I've set up a login-less ssh communication between the two macs 
(connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per 
the instructions on the FAQ. I can type in 'ssh computer-name.local' on either 
computer and connect without a password prompt. From what I can see, the 
ssh-agent is up and running - the following is listed in my ENV:

SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
SSH_AGENT_PID=61058

My host file simply lists 'localhost' and 
'chrisjones2@allana-welshs-mac-pro.local'.
 When I run a simple hello_world test, I get what seems like a reasonable 
output:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./test_hello
Hello world from process 0 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 6 of 8

I can also run hostname and get what seems to be an ok response (unless I'm 
wrong about this):

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se>
quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se>
quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se>
quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se>


However, when I run the ring_c test, it freezes:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented val