[OMPI users] openmpi 1.2.8 on Xgrid noob issue
Hi there, I'm currently trying to set up a small xgrid between two mac pros (a single quadcore and a 2 duo core), where both are directly connected via an ethernet cable. I've set up xgrid using the password authentication (rather than the kerberos), and from what I can tell in the Xgrid admin tool it seems to be working. However, once I try a simple hello world program, I get this error: chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on signal 15 (Terminated). 1 additional process aborted (not shown) 2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when collecting not enabled' *** Call stack at first throw: ( 0 CoreFoundation 0x7fff814237b4 __exceptionPreprocess + 180 1 libobjc.A.dylib 0x7fff84fe8f03 objc_exception_throw + 45 2 CoreFoundation 0x7fff8143e631 -[NSObject(NSObject) finalize] + 129 3 mca_pls_xgrid.so0x0001002a9ce3 -[PlsXGridClient dealloc] + 419 4 mca_pls_xgrid.so0x0001002a9837 orte_pls_xgrid_finalize + 40 5 libopen-rte.0.dylib 0x00010002d0f9 orte_pls_base_close + 249 6 libopen-rte.0.dylib 0x000100012027 orte_system_finalize + 119 7 libopen-rte.0.dylib 0x0001e968 orte_finalize + 40 8 mpirun 0x000111ff orterun + 2042 9 mpirun 0x00010a03 main + 27 10 mpirun 0x000109e0 start + 52 11 ??? 0x0004 0x0 + 4 ) terminate called after throwing an instance of 'NSException' [chris-joness-mac-pro:00350] *** Process received signal *** [chris-joness-mac-pro:00350] Signal: Abort trap (6) [chris-joness-mac-pro:00350] Signal code: (0) [chris-joness-mac-pro:00350] [ 0] 2 libSystem.B.dylib 0x7fff81ca51ba _sigtramp + 26 [chris-joness-mac-pro:00350] [ 1] 3 ??? 0x0001000cd400 0x0 + 4295808000 [chris-joness-mac-pro:00350] [ 2] 4 libstdc++.6.dylib 0x7fff830965d2 __tcf_0 + 0 [chris-joness-mac-pro:00350] [ 3] 5 libobjc.A.dylib 0x7fff84fecb39 _objc_terminate + 100 [chris-joness-mac-pro:00350] [ 4] 6 libstdc++.6.dylib 0x7fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11 [chris-joness-mac-pro:00350] [ 5] 7 libstdc++.6.dylib 0x7fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0 [chris-joness-mac-pro:00350] [ 6] 8 libstdc++.6.dylib 0x7fff83094bfc _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0 [chris-joness-mac-pro:00350] [ 7] 9 libobjc.A.dylib 0x7fff84fe8fa2 object_getIvar + 0 [chris-joness-mac-pro:00350] [ 8] 10 CoreFoundation 0x7fff8143e631 -[NSObject(NSObject) finalize] + 129 [chris-joness-mac-pro:00350] [ 9] 11 mca_pls_xgrid.so 0x0001002a9ce3 -[PlsXGridClient dealloc] + 419 [chris-joness-mac-pro:00350] [10] 12 mca_pls_xgrid.so 0x0001002a9837 orte_pls_xgrid_finalize + 40 [chris-joness-mac-pro:00350] [11] 13 libopen-rte.0.dylib 0x00010002d0f9 orte_pls_base_close + 249 [chris-joness-mac-pro:00350] [12] 14 libopen-rte.0.dylib 0x000100012027 orte_system_finalize + 119 [chris-joness-mac-pro:00350] [13] 15 libopen-rte.0.dylib 0x0001e968 orte_finalize + 40 [chris-joness-mac-pro:00350] [14] 16 mpirun 0x000111ff orterun + 2042 [chris-joness-mac-pro:00350] [15] 17 mpirun 0x00010a03 main + 27 [chris-joness-mac-pro:00350] [16] 18 mpirun 0x000109e0 start + 52 [chris-joness-mac-pro:00350] [17] 19 ??? 0x0004 0x0 + 4 [chris-joness-mac-pro:00350] *** End of error message *** Abort trap I've seen this error in a previous mailing, and it seems that the issue has something to do with forcing everything to use kerberos (SSO). However, I noticed that in the computer being used as an agent, this option is grayed on in the Xgrid sharing configuration (I have no idea why). I would therefore ask if it is absolutely necessary to use SSO to get openmpi to run with xgrid, or am I missing something with the password setup. Seems that the kerberos option is much more complicated, and I may even want to switch to just using openmpi with ssh. Many thanks, Chris Chris Jones Post-doctoral Research Assistant, Department of Microbiology Swedish University of Agricultural Sciences Uppsala, Sweden phone:
[OMPI users] Open MPI via SSH noob issue
(11.8 GB) > > Two different NICs are on the same subnet -- that doesn't seem like a good > idea...? I think this topic has come up on the users list before, and, IIRC, > the general consensus is "don't do that" because it's not clear as to which > NIC Linux will actually send outgoing traffic across bound for the > 192.168.1.x subnet. > > > > On Aug 4, 2011, at 1:59 PM, Keith Manville wrote: > >> I am having trouble running my MPI program on multiple nodes. I can >> run multiple processes on a single node, and I can spawn processes on >> on remote nodes, but when I call Send from a remote node, the node >> never returns, even though there is an appropriate Recv waiting. I'm >> pretty sure this is an issue with my configuration, not my code. I've >> tried some other sample programs I found and had the same problem of >> hanging on a send from one host to another. >> >> Here's an in depth description: >> >> I wrote a quick test program where each process with rank > 1 sends an >> int to the master (rank 0), and the master receives until it gets >> something from every other process. >> >> My test program works fine when I run multiple processes on a single machine. >> >> either the local node: >> >> $ ./mpirun -n 4 ./mpi-test >> Hi I'm localhost:2 >> Hi I'm localhost:1 >> localhost:1 sending 11... >> localhost:2 sending 12... >> localhost:2 sent 12 >> localhost:1 sent 11 >> Hi I'm localhost:0 >> localhost:0 received 11 from 1 >> localhost:0 received 12 from 2 >> Hi I'm localhost:3 >> localhost:3 sending 13... >> localhost:3 sent 13 >> localhost:0 received 13 from 3 >> all workers checked in! >> >> or a remote one: >> >> $ ./mpirun -np 2 -host remotehost ./mpi-test >> Hi I'm remotehost:0 >> remotehost:0 received 11 from 1 >> all workers checked in! >> Hi I'm remotehost:1 >> remotehost:1 sending 11... >> remotehost:1 sent 11 >> >> But when I try to run the master locally and the worker(s) remotely >> (this is the way I am actually interested in running it), Send never >> returns and it hangs indefinitely. >> >> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test >> Hi I'm localhost:0 >> Hi I'm remotehost:1 >> remotehost:1 sending 11... >> >> Just to see if it would work, I tried spawning the master on the >> remotehost and the worker on the localhost. >> >> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test >> Hi I'm localhost:1 >> localhost:1 sending 11... >> localhost:1 sent 11 >> Hi I'm remotehost:0 >> remotehost:0 received 0 from 1 >> all workers checked in! >> >> It doesn't hang on Send, but the wrong value is received. >> >> Any idea what's going on? I've attached my code, my config.log, >> ifconfig output, and ompi_info output. >> >> Thanks, >> Keith >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > -- > > Message: 4 > Date: Thu, 4 Aug 2011 20:48:30 -0400 > From: Jeff Squyres > Subject: Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=us-ascii > > I'm afraid our Xgrid support has lagged, and Apple hasn't show much interest > in MPI + Xgrid support -- much less HPC. :-\ > > Have you see the FAQ items about Xgrid? > >http://www.open-mpi.org/faq/?category=osx#xgrid-howto > > > On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote: > >> Hi there, >> >> I'm currently trying to set up a small xgrid between two mac pros (a single >> quadcore and a 2 duo core), where both are directly connected via an >> ethernet cable. I've set up xgrid using the password authentication (rather >> than the kerberos), and from what I can tell in the Xgrid admin tool it >> seems to be working. However, once I try a simple hello world program, I get >> this error: >> >> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello >> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on >> signal 15 (Terminated). >> 1 additional process aborted (not shown)
Re: [OMPI users] Open MPI via SSH noob issue
Hi, Thanks for the quick response.I managed to compile 1.5.3 on both computers using gcc-4.2, with the proper flags set (this took a bit of playing with, but I did eventually get it to compile). Once that was done, I installed it to a different directory from 1.2.8 (/opt/local/openmpi/), specified the PATH and LD_LIBRARY_PATH for the new version on each computer, then managed to get the hello_world script to run again so it could call each process, like before. However, I'm still in the same place - ring_c freezes up. I tried changing the hostname in the host file (just for poops and giggles - I see the response stating it doesn't matter), but to no avail. I made sure the firewall is off on both computers. I'm hoping I'm not doing something overly dumb here, but I'm still a bit stuck...I see in the FAQ that there were some issues with nehalem processors - I have two Xeons in one box and a nehalem in another. Could this make any difference? Thanks again, Chris On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote: > No, Open MPI doesn't use the names in the hostfile to figure out which TCP/IP > addresses to use (for example). Each process ends up publishing a list of IP > addresses at which it can be connected, and OMPI does routability > computations to figure out which is the "best" address to contact a given > peer on. > > If you're just starting with Open MPI, can you upgrade? 1.2.8 is pretty > ancient. Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our > "feature" series, but it's also relatively stable (new releases are coming in > both the 1.4.x and 1.5.x series soon, FWIW). > > > On Aug 9, 2011, at 12:14 PM, David Warren wrote: > >> I don't know if this is it, but if you use the name localhost, won't >> processes on both machines try to talk to 127.0.0.1? I believe you need to >> use the real hostname in you host file. I think that your two tests work >> because there is no interprocess communication, just stdout. >> >> On 08/08/11 23:46, Christopher Jones wrote: >>> Hi again, >>> >>> I changed the subject of my previous posting to reflect a new problem >>> encountered when I changed my strategy to using SSH instead of Xgrid on two >>> mac pros. I've set up a login-less ssh communication between the two macs >>> (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) >>> per the instructions on the FAQ. I can type in 'ssh computer-name.local' on >>> either computer and connect without a password prompt. From what I can see, >>> the ssh-agent is up and running - the following is listed in my ENV: >>> >>> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners >>> SSH_AGENT_PID=61058 >>> >>> My host file simply lists 'localhost' and >>> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world >>> test, I get what seems like a reasonable output: >>> >>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile >>> ./test_hello >>> Hello world from process 0 of 8 >>> Hello world from process 1 of 8 >>> Hello world from process 2 of 8 >>> Hello world from process 3 of 8 >>> Hello world from process 4 of 8 >>> Hello world from process 7 of 8 >>> Hello world from process 5 of 8 >>> Hello world from process 6 of 8 >>> >>> I can also run hostname and get what seems to be an ok response (unless I'm >>> wrong about this): >>> >>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname >>> allana-welshs-mac-pro.local >>> allana-welshs-mac-pro.local >>> allana-welshs-mac-pro.local >>> allana-welshs-mac-pro.local >>> quadcore.mikrob.slu.se >>> quadcore.mikrob.slu.se >>> quadcore.mikrob.slu.se >>> quadcore.mikrob.slu.se >>> >>> >>> However, when I run the ring_c test, it freezes: >>> >>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c >>> Process 0 sending 10 to 1, tag 201 (8 processes in ring) >>> Process 0 sent to 1 >>> Process 0 decremented value: 9 >>> >>> (I noted that processors on both computers are active). >>> >>> ring_c was compiled separately on each computer, however both have the same >>> version of openmpi and OSX. I've gone through the FAQ and searched the user >>> forum, but I can't quite seems to get this problem unstuck. >>> >>> Many thanks for your time, >>> Chris >
Re: [OMPI users] Open MPI via SSH noob issue
Hi again, Ok - I see what I missed in the FAQ, sorry about thatmy understanding of the shell is a bit minimal to say the least. I now have my .bashrc files configured as such on both computers: export LD_LIBRARY_PATH=/opt/local/openmpi/lib:{$PATH} export PATH=/opt/local/openmpi/bin:{$PATH} However, I am now running into a new issue that is still cryptic to me: quadcore:~ chrisjones$ /opt/local/openmpi/bin/mpirun -np 8 -hostfile hostfile ./ring_c Process 0 sending 10 to 1, tag 201 (8 processes in ring) [quadcore.mikrob.slu.se][[53435,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 127.0.0.1 failed: Connection refused (61) This may be superfluous, but I can connect to the localhost (ssh localhost) with no password prompt...Is there an ssh port I need to change somewhere? Again, thanks for your patience and help. Chris Have you setup your shell startup files such that they point to the new OMPI installation (/opt/local/openmpi/) even for non-interactive logins? Hi, Thanks for the quick response.I managed to compile 1.5.3 on both computers using gcc-4.2, with the proper flags set (this took a bit of playing with, but I did eventually get it to compile). Once that was done, I installed it to a different directory from 1.2.8 (/opt/local/openmpi/), specified the PATH and LD_LIBRARY_PATH for the new version on each computer, then managed to get the hello_world script to run again so it could call each process, like before. However, I'm still in the same place - ring_c freezes up. I tried changing the hostname in the host file (just for poops and giggles - I see the response stating it doesn't matter), but to no avail. I made sure the firewall is off on both computers. I'm hoping I'm not doing something overly dumb here, but I'm still a bit stuck...I see in the FAQ that there were some issues with nehalem processors - I have two Xeons in one box and a nehalem in another. Could this make any difference? Thanks again, Chris On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote: No, Open MPI doesn't use the names in the hostfile to figure out which TCP/IP addresses to use (for example). Each process ends up publishing a list of IP addresses at which it can be connected, and OMPI does routability computations to figure out which is the "best" address to contact a given peer on. If you're just starting with Open MPI, can you upgrade? 1.2.8 is pretty ancient. Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our "feature" series, but it's also relatively stable (new releases are coming in both the 1.4.x and 1.5.x series soon, FWIW). On Aug 9, 2011, at 12:14 PM, David Warren wrote: I don't know if this is it, but if you use the name localhost, won't processes on both machines try to talk to 127.0.0.1? I believe you need to use the real hostname in you host file. I think that your two tests work because there is no interprocess communication, just stdout. On 08/08/11 23:46, Christopher Jones wrote: Hi again, I changed the subject of my previous posting to reflect a new problem encountered when I changed my strategy to using SSH instead of Xgrid on two mac pros. I've set up a login-less ssh communication between the two macs (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per the instructions on the FAQ. I can type in 'ssh computer-name.local' on either computer and connect without a password prompt. From what I can see, the ssh-agent is up and running - the following is listed in my ENV: SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners SSH_AGENT_PID=61058 My host file simply lists 'localhost' and 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world test, I get what seems like a reasonable output: chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./test_hello Hello world from process 0 of 8 Hello world from process 1 of 8 Hello world from process 2 of 8 Hello world from process 3 of 8 Hello world from process 4 of 8 Hello world from process 7 of 8 Hello world from process 5 of 8 Hello world from process 6 of 8 I can also run hostname and get what seems to be an ok response (unless I'm wrong about this): chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname allana-welshs-mac-pro.local allana-welshs-mac-pro.local allana-welshs-mac-pro.local allana-welshs-mac-pro.local quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se> quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se> quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se> quadcore.mikrob.slu.se<http://quadcore.mikrob.slu.se> However, when I run the ring_c test, it freezes: chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c Process 0 sending 10 to 1, tag 201 (8 processes in ring) Process 0 sent to 1 Process 0 decremented val