Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
Hi Roberto My time is somewhat limited, so I couldn't review the code in detail. However, I think I got the gist of it. A few observations: 1. the code is rather inefficient, if all you want to do is spawn a pattern of slave processes based on a file. Unless there is some overriding reason for doing this one comm_spawn at a time, it would be far faster to issue a single comm_spawn and just provide the hostfile to us. You could use either the seq or rank_file mapper - both would take the file and provide the outcome you seek. The only difference would be that the child procs would all be in the same comm_world - don't know if that is an issue or not. 2. OMPI definitely cannot handle the threaded version of this code at this time - not sure when we will get to it. 3. if you serialize the code, we -should- be able to handle it. However, I'm not entirely sure your current method actually does that. It looks like you call comm_spawn, and then create a new thread which then calls comm_spawn. I'm afraid I can't quite figure out how the thread locking would occur to prevent multiple threads continuing to call comm_spawn - you might want to check it again and ensure it is correct. Frankly, I'm not entirely sure what the thread creation is gaining you - as I said, we can only call comm_spawn serially, so having multiple threads would seem to be unnecessary...unless this code is incomplete and you need the threads for some other purpose. Again, you might look at that loop_spawn code I mentioned before to see a working example. Alternatively, if your code works under HP MPI, you might want to stick with it for now until we get the threading support up to your required level. Hope that helps Ralph On Oct 3, 2008, at 10:36 AM, Roberto Fichera wrote: Ralph Castain ha scritto: Interesting. I ran a loop calling comm_spawn 1000 times without a problem. I suspect it is the threading that is causing the trouble here. I think so! My guessing is that at low level there is some trouble when handling *concurrent* orted spawning. Maybe You are welcome to send me the code. You can find my loop code in your code distribution under orte/test/mpi - look for loop_spawn and loop_child. In the attached code the spawing logic is currently under a loop in the main of the testmaster, so it's completly unthreaded at least until the MPI_Comm_spawn() terminate its work. If you wish like to test multithreading spawing you can comment the NodeThread_spawnSlave() in the main loop and uncomment the same function in the NodeThread_threadMain(). Finally if you want multithreading spawning but serialized against a mutex than uncomment the pthread_mutex_lock/unlock() in the NodeThread_threadMain(). This code run *without* any trouble in the HP MPI implementation. It works not so well in mpich2 trunk version due to two problems: limit of ~24.4K context id and/or a race in poll() while waiting a termination under MPI_Comm_disconnect() concurrently with a MPI_Comm_spawn(). Ralph On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote: Ralph Castain ha scritto: On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote: Ralph Castain ha scritto: I committed something to the trunk yesterday. Given the complexity of the fix, I don't plan to bring it over to the 1.3 branch until sometime mid-to-end next week so it can be adequately tested. Ok! So it means that I can checkout from the SVN/trunk to get you fix, right? Yes, though note that I don't claim it is fully correct yet. Still needs testing. However, I have tested it a fair amount and it seems okay. If you do test it, please let me know how it goes. I execute my test on the svn/trunk below Open MPI: 1.4a1r19677 Open MPI SVN revision: r19677 Open MPI release date: Unreleased developer copy Open RTE: 1.4a1r19677 Open RTE SVN revision: r19677 Open RTE release date: Unreleased developer copy OPAL: 1.4a1r19677 OPAL SVN revision: r19677 OPAL release date: Unreleased developer copy Ident string: 1.4a1r19677 below is the output which seems to freeze just after the second spawn. [roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 10 $PBS_NODEFILE [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received add_local_procs [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0 arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon INVALID arch ffc91200 Initializing MPI ... [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received sync+nidmap fro
Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment
Ralph Castain ha scritto: > Hi Roberto > > My time is somewhat limited, so I couldn't review the code in detail. > However, I think I got the gist of it. > > A few observations: > > 1. the code is rather inefficient, if all you want to do is spawn a > pattern of slave processes based on a file. Unless there is some > overriding reason for doing this one comm_spawn at a time, it would be > far faster to issue a single comm_spawn and just provide the hostfile > to us. You could use either the seq or rank_file mapper - both would > take the file and provide the outcome you seek. The only difference > would be that the child procs would all be in the same comm_world - > don't know if that is an issue or not. I agree with you if the all the spawned slaves has to communicate in the same comm_world, but since that's not case, as I already told you each slave will crunch different things, maybe also completly different data from the other slaves. The job's distribution will look like a tree or multi tree. The main problem is that we don't know in advance how to associate the slaves to the node, so we need a very dynamic distribution of the jobs while "unrolling" the algorithm, basically the application need to decide which is the best slave to run so that it can locally converge in the solution as better it can. That's why our distribution is quite *unusal* ... but, I would say, legal in the MPI-2 specs terms. > 2. OMPI definitely cannot handle the threaded version of this code at > this time - not sure when we will get to it. We are talking about only MPI_Comm_spawn() or in general the whole OpenMPI is not thread safe at moment? > 3. if you serialize the code, we -should- be able to handle it. > However, I'm not entirely sure your current method actually does that. > It looks like you call comm_spawn, and then create a new thread which > then calls comm_spawn. I'm afraid I can't quite figure out how the > thread locking would occur to prevent multiple threads continuing to > call comm_spawn - you might want to check it again and ensure it is > correct. Frankly, I'm not entirely sure what the thread creation is > gaining you - as I said, we can only call comm_spawn serially, so > having multiple threads would seem to be unnecessary...unless this > code is incomplete and you need the threads for some other purpose. Could you explain, which part I have to serialize in order to meet the OpenMPI expectation? Can I send/receive in a multithreading fashion, for example? I need threading because each thread will handle one communication with one slave. In my code, the comm_spawn() is called in a thread and in the same thread I'll drive the MPI communication, when the slave computation is terminated the thread will accordly terminate. > Again, you might look at that loop_spawn code I mentioned before to > see a working example. Alternatively, if your code works under HP MPI, > you might want to stick with it for now until we get the threading > support up to your required level. About your example, I see that it does permit to merge slaves in one intercommunicator, but what happen in the intercommunicator if one or more slaves complete their work and I would reuse it for doing other things. Basically I need to pair the slave with the data to send it for its related computation, that's why we create different intercommunicator one for each slave because they aren't related or at least we locally decide if we need more than one slave for crunching data, in that case the spawn will be instrumented to spawn say 10 nodes for a single computation. So only that case we "fall back" in the "standard usage" ;-)! > > Hope that helps > Ralph > > On Oct 3, 2008, at 10:36 AM, Roberto Fichera wrote: > >> Ralph Castain ha scritto: >>> Interesting. I ran a loop calling comm_spawn 1000 times without a >>> problem. I suspect it is the threading that is causing the trouble >>> here. >> I think so! My guessing is that at low level there is some trouble when >> handling *concurrent* >> orted spawning. Maybe >>> You are welcome to send me the code. You can find my loop code in your >>> code distribution under orte/test/mpi - look for loop_spawn and >>> loop_child. >> In the attached code the spawing logic is currently under a loop in the >> main of the testmaster, so it's completly >> unthreaded at least until the MPI_Comm_spawn() terminate its work. If >> you wish like to test multithreading spawing >> you can comment the NodeThread_spawnSlave() in the main loop and >> uncomment the same function in the >> NodeThread_threadMain(). Finally if you want multithreading spawning but >> serialized against a mutex than uncomment >> the pthread_mutex_lock/unlock() in the NodeThread_threadMain(). >> >> This code run *without* any trouble in the HP MPI implementation. It >> works not so well in mpich2 trunk version due >> to two problems: limit of ~24.4K context id and/or a race in poll() >> while waiting a termination under MPI_Comm_disconnect() >> co
Re: [OMPI users] Problem building OpenMPi with SunStudio compilers
On Sat, Oct/04/2008 11:21:27AM, Raymond Muno wrote: > Raymond Muno wrote: >> Raymond Muno wrote: >>> We are implementing a new cluster that is InfiniBand based. I am working >>> on getting OpenMPI built for our various compile environments. So far it >>> is working for PGI 7.2 and PathScale 3.1. I found some workarounds for >>> issues with the Pathscale compilers (seg faults) in the OpenMPI FAQ. >>> >>> When I try to build with SunStudio, I cannot even get past the configure >>> stage. It dies in th estage that checks for C++. >> >> It looks like the problem is with SunStudio itself. Even a simple CC >> program fails to compile. >>> >>> /usr/lib64/libm.so: file not recognized: File format not recognized > OK, I took care of the linker issue fro C++ as recommended on Suns support > site (replace Sun supplied ld with /usr/bin/ld) > > Now I get farther along but the build fails at (small excerpt) > > mutex.c:(.text+0x30): multiple definition of `opal_atomic_cmpset_32' > asm/.libs/libasm.a(asm.o):asm.c:(.text+0x30): first defined here > threads/.libs/mutex.o: In function `opal_atomic_cmpset_64': > mutex.c:(.text+0x50): multiple definition of `opal_atomic_cmpset_64' > asm/.libs/libasm.a(asm.o):asm.c:(.text+0x50): first defined here > make[2]: *** [libopen-pal.la] Error 1 > make[2]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' > make: *** [all-recursive] Error 1 > > I based the configure on what was found in the FAQ here. > > http://www.open-mpi.org/faq/?category=building#build-sun-compilers > > Perhaps this is much more specific to our platform/OS. > > The environment is AMD Opteron, Barcelona running Centos 5 > (Rocks 5.03) with SunStudio 12 compilers. > Unfortunately I haven't seen the above issue, so I don't have a workaround to propose. There are some issues that have been fixed with GCC-style inline assembly in the latest Sun Studio Express build. Could you try it out? http://developers.sun.com/sunstudio/downloads/express/index.jsp -Ethan > Does anyone have any insight as to how to successfully > build OpenMPI for this OS/compiler selection? As I said > in the first post, we have it built for Pathscale 3.1 and > PGI 7.2. > -Ray Muno > University of Minnesota, Aerospace Engineering > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] does openmpi have C++ bindings?
Yes, OMPI's C++ bindings are built by default if you have a valid C++ compiler. ompi_info should indicate whether you have the C++ bindings built or not. But the C++ bindings don't allow sending/receiving STL containers via MPI calls. For that, as someone else suggested, have a look at Boost.MPI. The boost group built a nice C++ class library on top of MPI. On Oct 2, 2008, at 5:25 PM, Shafagh Jafer wrote: Does openmpi have C++ bindings? or I need to install this package?if yes from where and how? Thanks. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Problem building OpenMPi with SunStudio compilers
Ethan Mallove wrote: >> Now I get farther along but the build fails at (small excerpt) >> >> mutex.c:(.text+0x30): multiple definition of `opal_atomic_cmpset_32' >> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x30): first defined here >> threads/.libs/mutex.o: In function `opal_atomic_cmpset_64': >> mutex.c:(.text+0x50): multiple definition of `opal_atomic_cmpset_64' >> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x50): first defined here >> make[2]: *** [libopen-pal.la] Error 1 >> make[2]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' >> make: *** [all-recursive] Error 1 >> >> I based the configure on what was found in the FAQ here. >> >> http://www.open-mpi.org/faq/?category=building#build-sun-compilers >> >> Perhaps this is much more specific to our platform/OS. >> >> The environment is AMD Opteron, Barcelona running Centos 5 >> (Rocks 5.03) with SunStudio 12 compilers. >> > > Unfortunately I haven't seen the above issue, so I don't > have a workaround to propose. There are some issues that > have been fixed with GCC-style inline assembly in the latest > Sun Studio Express build. Could you try it out? > > http://developers.sun.com/sunstudio/downloads/express/index.jsp > > -Ethan > > Looks like it dies at the exact same spot. I have the C++ failure as well (supplied ld does not work). -- Ray Muno http://www.aem.umn.edu/people/staff/muno University of Minnesota e-mail: m...@aem.umn.edu Aerospace Engineering and MechanicsPhone: (612) 625-9531 110 Union St. S.E. FAX: (612) 626-1558 Minneapolis, Mn 55455
Re: [OMPI users] OpenMPI with openib partitions
On Oct 5, 2008, at 1:22 PM, Lenny Verkhovsky wrote: you should probably use -mca tcp,self -mca btl_openib_if_include ib0.8109 Really? I thought we only took OpenFabrics device names in the openib_if_include MCA param...? It looks like ib0.8109 is an IPoIB device name. Lenny. On 10/3/08, Matt Burgess wrote: Hi, I'm trying to get openmpi working over openib partitions. On this cluster, the partition number is 0x109. The ib interfaces are pingable over the appropriate ib0.8109 interface: d2:/opt/openmpi-ib # ifconfig ib0.8109 ib0.8109 Link encap:UNSPEC HWaddr 80-00-00-4A- FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.21.48.2 Bcast:10.21.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:16811 errors:0 dropped:0 overruns:0 frame:0 TX packets:15848 errors:0 dropped:1 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:102229428 (97.4 Mb) TX bytes:102324172 (97.5 Mb) I have tried the following: /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl openib,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109 -mca btl_openib_ib_pkey_ix 1 /cluster/ pallas/x86_64-ib/IMB-MPI1 but I just get a RETRY EXCEEDED ERROR. Is there a MCA parameter I am missing? I was successful using tcp only: /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl tcp,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109 /cluster/pallas/x86_64-ib/IMB-MPI1 Thanks, Matt Burgess ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] segfault issue - possible bug in openmpi
Yes, there still could be a dependence on a number of processors and using threads. But it's not clear from the stack trace if this is a threaded problem or not (and it is correct that OMPI v1.2's thread support is non-functional). As for more information that would help diagnose the problem, please see http://www.open-mpi.org/community/help/ Thanks. On Oct 4, 2008, at 9:28 PM, Doug Reeder wrote: Shafagh, I missed the dependence on the number of processors. Apparently there is some thread support. Doug On Oct 4, 2008, at 5:29 PM, Shafagh Jafer wrote: Doug Reeder, Daniel is saying that the problem only occurs in openmpi when running more than 16 processes. So could that still be cause becasue openmpi does not support threads??!! --- On Fri, 10/3/08, Doug Reeder wrote: From: Doug Reeder Subject: Re: [OMPI users] segfault issue - possible bug in openmpi To: "Open MPI Users" Date: Friday, October 3, 2008, 2:40 PM Daniel, Are you using threads. I don't think the opempi-1.2.x work with threads. Doug Reeder On Oct 3, 2008, at 2:30 PM, Daniel Hansen wrote: Oh, by the way, here is the segfault: [m4b-1-8:11481] *** Process received signal *** [m4b-1-8:11481] Signal: Segmentation fault (11) [m4b-1-8:11481] Signal code: Address not mapped (1) [m4b-1-8:11481] Failing at address: 0x2b91c69eed [m4b-1-8:11483] [ 0] /lib64/libpthread.so.0 [0x33e8c0de70] [m4b-1-8:11483] [ 1] /fslhome/dhansen7/openmpi/lib/libmpi.so.0 [0x2abea7c0] [m4b-1-8:11483] [ 2] /fslhome/dhansen7/openmpi/lib/libmpi.so.0 [0x2abea675] [m4b-1-8:11483] [ 3] /fslhome/dhansen7/openmpi/lib/libmpi.so. 0(mca_pml_ob1_send+0x2da) [0x2abeaf55] [m4b-1-8:11483] [ 4] /fslhome/dhansen7/openmpi/lib/libmpi.so. 0(MPI_Send+0x28e) [0x2ab52c5a] [m4b-1-8:11483] [ 5] /fslhome/dhansen7/compute/for_DanielHansen/ replica_mpi_marylou2/Openmpi_md_twham(twham_init+0x708) [0x42a8a8] [m4b-1-8:11483] [ 6] /fslhome/dhansen7/compute/for_DanielHansen/ replica_mpi_marylou2/Openmpi_md_twham(repexch+0x73c) [0x425d5c] [m4b-1-8:11483] [ 7] /fslhome/dhansen7/compute/for_DanielHansen/ replica_mpi_marylou2/Openmpi_md_twham(main+0x855) [0x4133a5] [m4b-1-8:11483] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33e841d8a4] [m4b-1-8:11483] [ 9] /fslhome/dhansen7/compute/for_DanielHansen/ replica_mpi_marylou2/Openmpi_md_twham [0x4040b9] [m4b-1-8:11483] *** End of error message *** On Fri, Oct 3, 2008 at 3:20 PM, Daniel Hansen wrote: I have been testing some code against openmpi lately that always causes it to crash during certain mpi function calls. The code does not seem to be the problem, as it runs just fine against mpich. I have tested it against openmpi 1.2.5, 1.2.6, and 1.2.7 and they all exhibit the same problem. Also, the problem only occurs in openmpi when running more than 16 processes. I have posted this stack trace to the list before, but I am submitting it now as a potential bug report. I need some help debugging it and finding out exactly what is going on in openmpi when the segfault occurs. Are there any suggestions on how best to do this? Is there an easy way to attach gdb to one of the processes or something?? I have already compiled openmpi with debugging, memory profiling, etc. How can I best take advantage of these features? Thanks, Daniel Hansen Systems Administrator BYU Fulton Supercomputing Lab ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
Hi all, This is the procedure i have followed to install openmpi. Is there some installation or environment setting problem in here? an openmpi program with 4 process is run across 2 dual-core intel machines, with 2 processes running on each of the machine. ompi-checkpoint is successful but ompi-restart fails with following error $:> ompi-restart ompi_global_snapshot_6045.ckpt -- mpirun noticed that process rank 0 with PID 6372 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -- Open-mpi installation steps: ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr --with-blcr=/usr/lib64 --enable-debug make make install export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64 export PATH=$HOME/.openmpi/bin:$PATH NOTE: blcr is installed as a module $:> lsmod | grep blcr blcr 117892 0 blcr_vmadump 58264 1 blcr blcr_imports 46080 2 blcr,blcr_vmadump Please let me know if there is problem with above procedure, thanks a lot for your time. Best. -- Forwarded message -- From: arun dhakne List-Post: users@lists.open-mpi.org Date: Tue, Sep 30, 2008 at 12:52 AM Subject: ompi-restart issue : ompi-restart doesn't work across nodes To: Open MPI Users Hi all, I had gone through some previous ompi-restart issues but i couldn't find anything similar to this problem. I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645' i) If the sample mpi program say ( np 4 on single machine that is without any hostfile )is ran and I try to checkpoint it, it happens successfully and even ompi-restart works in this case. ii) If the sample mpi program is ran across say 2 different nodes and checkpoint happens successfully BUT ompi-restart throws following error: $ ompi-restart ompi_global_snapshot_7604.ckpt -- mpirun noticed that process rank 3 with PID 9590 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -- Please let me know if more information is needed. -- Thanks and Regards, Arun U. Dhakne
[OMPI users] Problem launching onto Bourne shell
Hi, I'm having difficulty launching an Open MPI job onto a machine that is running the Bourne shell. Here's my basic setup. I have two machines, one is an x86-based machine running bash and the other is a Cell-based machine running Bourne shell. I'm running mpirun from the x86 machine, which launches a C++ MPI application onto the Cell machine. I get the following error: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory The basic problem is that LD_LIBRARY_PATH needs to be set to the directory that contains libstdc++.so.6 for the Cell. I set the following line in .profile: export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 which is the path to the PPC libraries for Cell. Now if I log directly into the Cell machine and run the program directly from the command line, I don't get the above error. But mpirun still fails, even after setting LD_LIBRARY_PATH in .profile. As a sanity check, I did the following. I ran the following command from the x86 machine: mpirun -np 1 --host cab0 env which, among others things, shows me the following value: LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib: If I log into the Cell machine and run env directly from the command line, I get the following value: LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 So it appears that .profile gets sourced when I log in but not when mpirun runs. However, according to the OpenMPI FAQ (http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path ), mpirun is supposed to directly call .profile since Bourne shell doesn't automatically call it for non-interactive shells. Does anyone have any insight as to why my environment isn't being set properly? Thanks! Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255
Re: [OMPI users] Problem launching onto Bourne shell
tYou can forward your local env with mpirun -x LD_LIBRARY_PATH. As an alternative you can set specific values with mpirun -x LD_LIBRARY_PATH=/some/where:/some/where/else . More information with mpirun --help (or man mpirun). Aurelien Le 6 oct. 08 à 16:06, Hahn Kim a écrit : Hi, I'm having difficulty launching an Open MPI job onto a machine that is running the Bourne shell. Here's my basic setup. I have two machines, one is an x86-based machine running bash and the other is a Cell-based machine running Bourne shell. I'm running mpirun from the x86 machine, which launches a C++ MPI application onto the Cell machine. I get the following error: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory The basic problem is that LD_LIBRARY_PATH needs to be set to the directory that contains libstdc++.so.6 for the Cell. I set the following line in .profile: export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 which is the path to the PPC libraries for Cell. Now if I log directly into the Cell machine and run the program directly from the command line, I don't get the above error. But mpirun still fails, even after setting LD_LIBRARY_PATH in .profile. As a sanity check, I did the following. I ran the following command from the x86 machine: mpirun -np 1 --host cab0 env which, among others things, shows me the following value: LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib: If I log into the Cell machine and run env directly from the command line, I get the following value: LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 So it appears that .profile gets sourced when I log in but not when mpirun runs. However, according to the OpenMPI FAQ (http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path ), mpirun is supposed to directly call .profile since Bourne shell doesn't automatically call it for non-interactive shells. Does anyone have any insight as to why my environment isn't being set properly? Thanks! Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321
Re: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
The installation looks ok, though I'm not sure what is causing the segfault of the restarted process. Two things to try. First can you send me a backtrace from the core file that is generated from the segmentation fault. That will provide insight into what is causing it. Second you may try to enable the C/R thread which allows for a checkpoint to progress when an application is in a computation loop instead of only when it is in the MPI library. To do so configure with these additional flags: --enable-ft-thread --enable-mpi-threads What version of Open MPI are you using? What version of BLCR? Best, Josh On Oct 6, 2008, at 3:55 PM, arun dhakne wrote: Hi all, This is the procedure i have followed to install openmpi. Is there some installation or environment setting problem in here? an openmpi program with 4 process is run across 2 dual-core intel machines, with 2 processes running on each of the machine. ompi-checkpoint is successful but ompi-restart fails with following error $:> ompi-restart ompi_global_snapshot_6045.ckpt -- mpirun noticed that process rank 0 with PID 6372 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -- Open-mpi installation steps: ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr --with-blcr=/usr/lib64 --enable-debug make make install export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/ openmpi:/usr/lib64 export PATH=$HOME/.openmpi/bin:$PATH NOTE: blcr is installed as a module $:> lsmod | grep blcr blcr 117892 0 blcr_vmadump 58264 1 blcr blcr_imports 46080 2 blcr,blcr_vmadump Please let me know if there is problem with above procedure, thanks a lot for your time. Best. -- Forwarded message -- From: arun dhakne Date: Tue, Sep 30, 2008 at 12:52 AM Subject: ompi-restart issue : ompi-restart doesn't work across nodes To: Open MPI Users Hi all, I had gone through some previous ompi-restart issues but i couldn't find anything similar to this problem. I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645' i) If the sample mpi program say ( np 4 on single machine that is without any hostfile )is ran and I try to checkpoint it, it happens successfully and even ompi-restart works in this case. ii) If the sample mpi program is ran across say 2 different nodes and checkpoint happens successfully BUT ompi-restart throws following error: $ ompi-restart ompi_global_snapshot_7604.ckpt -- mpirun noticed that process rank 3 with PID 9590 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -- Please let me know if more information is needed. -- Thanks and Regards, Arun U. Dhakne ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem launching onto Bourne shell
Great, that worked, thanks! However, it still concerns me that the FAQ page says that mpirun will execute .profile which doesn't seem to work for me. Are there any configuration issues that could possibly be preventing mpirun from doing this? It would certainly be more convenient if I could maintain my environment in a single .profile file instead of adding what could potentially be a lot of -x arguments to my mpirun command. Hahn On Oct 6, 2008, at 5:44 PM, Aurélien Bouteiller wrote: tYou can forward your local env with mpirun -x LD_LIBRARY_PATH. As an alternative you can set specific values with mpirun -x LD_LIBRARY_PATH=/some/where:/some/where/else . More information with mpirun --help (or man mpirun). Aurelien Le 6 oct. 08 à 16:06, Hahn Kim a écrit : Hi, I'm having difficulty launching an Open MPI job onto a machine that is running the Bourne shell. Here's my basic setup. I have two machines, one is an x86-based machine running bash and the other is a Cell-based machine running Bourne shell. I'm running mpirun from the x86 machine, which launches a C++ MPI application onto the Cell machine. I get the following error: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory The basic problem is that LD_LIBRARY_PATH needs to be set to the directory that contains libstdc++.so.6 for the Cell. I set the following line in .profile: export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 which is the path to the PPC libraries for Cell. Now if I log directly into the Cell machine and run the program directly from the command line, I don't get the above error. But mpirun still fails, even after setting LD_LIBRARY_PATH in .profile. As a sanity check, I did the following. I ran the following command from the x86 machine: mpirun -np 1 --host cab0 env which, among others things, shows me the following value: LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib: If I log into the Cell machine and run env directly from the command line, I get the following value: LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32 So it appears that .profile gets sourced when I log in but not when mpirun runs. However, according to the OpenMPI FAQ (http://www.open-mpi.org/ faq/?category=running#adding-ompi-to-path ), mpirun is supposed to directly call .profile since Bourne shell doesn't automatically call it for non-interactive shells. Does anyone have any insight as to why my environment isn't being set properly? Thanks! Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Hahn Kim MIT Lincoln Laboratory Phone: (781) 981-0940 244 Wood Street, S2-252 Fax: (781) 981-5255 Lexington, MA 02420 E-mail: h...@ll.mit.edu