Re: [OMPI users] Begginers question: why does this program hangs?
Hmm, strange. It doesn't hang for me and AFAICS it shouldn't hang at all. I'm using 1.2.5. Which version of Open MPI are you using? Hanging with 100% CPU utilization often means that your processes are caught in a busy wait. You could try to set mpi_yield_when_idle: > gentryx@hex ~ $ cat .openmpi/mca-params.conf > mpi_yield_when_idle=1 But I don't think this should be necessary. HTH -Andreas On 21:35 Mon 17 Mar , Giovani Faccin wrote: > Hi there! > > I'm learning MPI, and got really puzzled... Please take a look at this very > short code: > > #include > #include "mpicxx.h" > using namespace std; > int main(int argc, char *argv[]) > { > MPI::Init(); > > for (unsigned long t = 0; t < 1000; t++) > { > //If we are process 0: > if ( MPI::COMM_WORLD.Get_rank() == 0 ) > { > MPI::Status mpi_status; > unsigned long d = 0; > unsigned long d2 = 0; > MPI::COMM_WORLD.Recv(&d, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, > MPI::ANY_TAG, mpi_status ); > MPI::COMM_WORLD.Recv(&d2, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, > MPI::ANY_TAG, mpi_status ); > cout << "Time = " << t << "; Node 0 received: " << d << " and " > << d2 << endl; > } > //Else: > else > { > unsigned long d = MPI::COMM_WORLD.Get_rank(); > MPI::COMM_WORLD.Send( &d, 1, MPI::UNSIGNED_LONG, 0, 0); > }; > }; > MPI::Finalize(); > } > > Ok, so what I'm trying to do is to make a gather operation using point to > point communication. In my real application instead of sending an unsigned > long I'd be calling an object's send and receive methods, which in turn would > call their inner object's similar methods and so on until all data is > syncronized. I'm using this loop because the number of objects to be sent to > process rank 0 varies depending on the sender. > > When running this test with 3 processes on a dual core, oversubscribed node, > I get this output: > (skipped previous output) > Time = 5873; Node 0 received: 1 and 2 > Time = 5874; Node 0 received: 1 and 2 > Time = 5875; Node 0 received: 1 and 2 > Time = 5876; Node 0 received: 1 and 2 > > and then the application hangs, with processor usage at 100%. The exact time > when this condition occurs varies on each run, but it usually happens quite > fast. > > What would I have to modify, in this simple example, so that the application > works as expected? Must I always use Gather, instead of point to point, to > make a syncronization like this? > > Thank you very much! > > Giovani > > > > > > > > __ > Fale com seus amigos de graça com o novo Yahoo! Messenger > http://br.messenger.yahoo.com/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! pgptmuyjgkEBE.pgp Description: PGP signature
Re: [OMPI users] Begginers question: why does this program hangs?
Hi Andreas, thanks for the reply! I'm using openmpi-1.2.5. It was installed using my distro's (Gentoo) default package: sys-cluster/openmpi-1.2.5 USE="fortran ipv6 -debug -heterogeneous -nocxx -pbs -romio -smp -threads" I've tried setting the mpi_yield_when_idle parameter as you asked. However, the program still hangs. Just in case, the command line I'm using to call it is this: /usr/bin/mpirun --hostfile mpi-config.txt --mca mpi_yield_when_idle 1 -np 3 /home/gfaccin/desenvolvimento/Eclipse/mpiplay/Debug/mpiplay where mpi-config.txt contains the following line: localhost slots=1 Anything else I could try? Thank you! Giovani Andreas Schäfer escreveu: Hmm, strange. It doesn't hang for me and AFAICS it shouldn't hang at all. I'm using 1.2.5. Which version of Open MPI are you using? Hanging with 100% CPU utilization often means that your processes are caught in a busy wait. You could try to set mpi_yield_when_idle: > gentryx@hex ~ $ cat .openmpi/mca-params.conf > mpi_yield_when_idle=1 But I don't think this should be necessary. HTH -Andreas On 21:35 Mon 17 Mar , Giovani Faccin wrote: > Hi there! > > I'm learning MPI, and got really puzzled... Please take a look at this very > short code: > > #include > #include "mpicxx.h" > using namespace std; > int main(int argc, char *argv[]) > { > MPI::Init(); > > for (unsigned long t = 0; t < 1000; t++) > { > //If we are process 0: > if ( MPI::COMM_WORLD.Get_rank() == 0 ) > { > MPI::Status mpi_status; > unsigned long d = 0; > unsigned long d2 = 0; > MPI::COMM_WORLD.Recv(&d, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, > MPI::ANY_TAG, mpi_status ); > MPI::COMM_WORLD.Recv(&d2, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, > MPI::ANY_TAG, mpi_status ); > cout << "Time = " << t << "; Node 0 received: " << d << " and " > << d2 << endl; > } > //Else: > else > { > unsigned long d = MPI::COMM_WORLD.Get_rank(); > MPI::COMM_WORLD.Send( &d, 1, MPI::UNSIGNED_LONG, 0, 0); > }; > }; > MPI::Finalize(); > } > > Ok, so what I'm trying to do is to make a gather operation using point to > point communication. In my real application instead of sending an unsigned > long I'd be calling an object's send and receive methods, which in turn would > call their inner object's similar methods and so on until all data is > syncronized. I'm using this loop because the number of objects to be sent to > process rank 0 varies depending on the sender. > > When running this test with 3 processes on a dual core, oversubscribed node, > I get this output: > (skipped previous output) > Time = 5873; Node 0 received: 1 and 2 > Time = 5874; Node 0 received: 1 and 2 > Time = 5875; Node 0 received: 1 and 2 > Time = 5876; Node 0 received: 1 and 2 > > and then the application hangs, with processor usage at 100%. The exact time > when this condition occurs varies on each run, but it usually happens quite > fast. > > What would I have to modify, in this simple example, so that the application > works as expected? Must I always use Gather, instead of point to point, to > make a syncronization like this? > > Thank you very much! > > Giovani > > > > > > > > __ > Fale com seus amigos de graça com o novo Yahoo! Messenger > http://br.messenger.yahoo.com/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!
Re: [OMPI users] SIGSEGV error.
On Mar 17, 2008, at 10:16 PM, balaji srinivas wrote: I am new to MPI. The outline of my code is if(r==0) function1() else if(r==1) function2() where r is the rank and functions are included in the .h files. There are no compilation errors. I get the SIGSEGV error while running. Pls help. how to solve this? From your description, it is impossible to tell if this is an MPI issue or not. You should probably use standard debugging techniques, such as using a debugger, examining core files, etc. See http://www.open-mpi.org/faq/?category=debugging if you need some suggestions for debugging in parallel. 2) how to find the execution time of a mpi program. in C we have clock_t start=clock() at the beginning and ((double)clock() - start) / CLOCKS_PER_SEC) at the end. I don't quite understand your question -- is your use of clock() reporting incorrect wall clock times? -- Jeff Squyres Cisco Systems
Re: [OMPI users] SIGSEGV error.
Hey Balaji I'm new at it too, but might be able to help you a bit. A sigsegv error occurs usually when you try to access something in memory that's not actually there. Like using a pointer that points to nothing. In my short experience with MPI so far, I got this kind of message when I made something wrong with the functions, like for example sending a buffer and, when telling MPI it's size, giving a wrong value. Make sure there's nothing like that on your code. For the time thing, I think that what you want is the MPI_WTime() function. Check this out: https://computing.llnl.gov/tutorials/mpi/man/MPI_Wtime.txt Best, Giovani balaji srinivas escreveu: hi all, I am new to MPI. The outline of my code is if(r==0) function1() else if(r==1) function2() where r is the rank and functions are included in the .h files. There are no compilation errors. I get the SIGSEGV error while running. Pls help. how to solve this? 2) how to find the execution time of a mpi program. in C we have clock_t start=clock() at the beginning and ((double)clock() - start) / CLOCKS_PER_SEC) at the end. Thanks in advance. regards, balaji. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!
Re: [OMPI users] Begginers question: why does this program hangs?
Two notes for you: 1. Your program does necessarily guarantee what you might expect: since you use ANY_SOURCE/ANY_TAG in both the receives, you might actually get two receives from the same sender in a given iteration. The fact that you're effectively using yield_when_idle (which OMPI will automatically enable when you tell it "slots=1" but you run with - np 3) means that you probably *won't* have this happen (because every MPI process will yield on every iteration, effectively keeping all 3 in lock step), but it still *can* happen (and did frequently in my tests). 2. The problem you're seeing is an optimization called "early completion" where, for latency ping-pong optimizations, Open MPI may indicate that a send has "completed" before the message is actually placed on the network (shared memory, in your case). This can be a nice performance boost for applications that both a) dip into the MPI layer frequently and b) synchronize at some point. Your application is not necessarily doing this in the final iterations; it may reach MPI_FINALIZE while there's still a pile of messages that have been queued for delivery before they are actually progressed out the network to the receiver. In our upcoming 1.2.6 release, there is a run-time parameter to disable this early completion behavior (i.e., never signal completion of a send before the data is actually transmitted out on the network). You can try the 1.2.6rc2 tarball: http://www.open-mpi.org/software/ompi/v1.2/ And use the following MCA parameter: mpirun --mca pml_ob1_use_early_completion 0 ... See if that works for you. On Mar 18, 2008, at 7:11 AM, Giovani Faccin wrote: Hi Andreas, thanks for the reply! I'm using openmpi-1.2.5. It was installed using my distro's (Gentoo) default package: sys-cluster/openmpi-1.2.5 USE="fortran ipv6 -debug -heterogeneous - nocxx -pbs -romio -smp -threads" I've tried setting the mpi_yield_when_idle parameter as you asked. However, the program still hangs. Just in case, the command line I'm using to call it is this: /usr/bin/mpirun --hostfile mpi-config.txt --mca mpi_yield_when_idle 1 -np 3 /home/gfaccin/desenvolvimento/Eclipse/mpiplay/Debug/mpiplay where mpi-config.txt contains the following line: localhost slots=1 Anything else I could try? Thank you! Giovani Andreas Schäfer escreveu: Hmm, strange. It doesn't hang for me and AFAICS it shouldn't hang at all. I'm using 1.2.5. Which version of Open MPI are you using? Hanging with 100% CPU utilization often means that your processes are caught in a busy wait. You could try to set mpi_yield_when_idle: > gentryx@hex ~ $ cat .openmpi/mca-params.conf > mpi_yield_when_idle=1 But I don't think this should be necessary. HTH -Andreas On 21:35 Mon 17 Mar , Giovani Faccin wrote: > Hi there! > > I'm learning MPI, and got really puzzled... Please take a look at this very short code: > > #include > #include "mpicxx.h" > using namespace std; > int main(int argc, char *argv[]) > { > MPI::Init(); > > for (unsigned long t = 0; t < 1000; t++) > { > //If we are process 0: > if ( MPI::COMM_WORLD.Get_rank() == 0 ) > { > MPI::Status mpi_status; > unsigned long d = 0; > unsigned long d2 = 0; > MPI::COMM_WORLD.Recv(&d, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, MPI::ANY_TAG, mpi_status ); > MPI::COMM_WORLD.Recv(&d2, 1, MPI::UNSIGNED_LONG, MPI::ANY_SOURCE, MPI::ANY_TAG, mpi_status ); > cout << "Time = " << t << "; Node 0 received: " << d << " and " << d2 << endl; > } > //Else: > else > { > unsigned long d = MPI::COMM_WORLD.Get_rank(); > MPI::COMM_WORLD.Send( &d, 1, MPI::UNSIGNED_LONG, 0, 0); > }; > }; > MPI::Finalize(); > } > > Ok, so what I'm trying to do is to make a gather operation using point to point communication. In my real application instead of sending an unsigned long I'd be calling an object's send and receive methods, which in turn would call their inner object's similar methods and so on until all data is syncronized. I'm using this loop because the number of objects to be sent to process rank 0 varies depending on the sender. > > When running this test with 3 processes on a dual core, oversubscribed node, I get this output: > (skipped previous output) > Time = 5873; Node 0 received: 1 and 2 > Time = 5874; Node 0 received: 1 and 2 > Time = 5875; Node 0 received: 1 and 2 > Time = 5876; Node 0 received: 1 and 2 > > and then the application hangs, with processor usage at 100%. The exact time when this condition occurs varies on each run, but it usually happens quite fast. > > What would I have to modify, in this simple example, so that the application works as expected? Must I always use Gather, instead of point to point, to make a syncronization like this? > > Thank you very much! > > Giovani > > > > > > > > __ > Fale com seus amigos de graça com o novo Yahoo! Messenger > http://br.messenger.yahoo.com/ > __
Re: [OMPI users] Begginers question: why does this program hangs?
OK, this is strange. I've rerun the test and got it to block, too. Although repeated tests show that those are rare (sometimes the program runs smoothly without blocking, but in about 30% of the cases it hangs just like you said). On 08:11 Tue 18 Mar , Giovani Faccin wrote: > I'm using openmpi-1.2.5. It was installed using my distro's (Gentoo) default > package: > > sys-cluster/openmpi-1.2.5 USE="fortran ipv6 -debug -heterogeneous -nocxx > -pbs -romio -smp -threads" Just like me. I've attached gdb to all three processes. On rank 0 I get the following backtrace: (gdb) bt #0 0x2ada849b3f16 in mca_btl_sm_component_progress () from /usr/lib64/openmpi/mca_btl_sm.so #1 0x2ada845a71da in mca_bml_r2_progress () from /usr/lib64/openmpi/mca_bml_r2.so #2 0x2ada7e6fbbea in opal_progress () from /usr/lib64/libopen-pal.so.0 #3 0x2ada8439a9a5 in mca_pml_ob1_recv () from /usr/lib64/openmpi/mca_pml_ob1.so #4 0x2ada7e2570a8 in PMPI_Recv () from /usr/lib64/libmpi.so.0 #5 0x0040c9ae in MPI::Comm::Recv () #6 0x00409607 in main () On rank 1: (gdb) bt #0 0x2baa6869bcc0 in mca_btl_sm_send () from /usr/lib64/openmpi/mca_btl_sm.so #1 0x2baa6808a93d in mca_pml_ob1_send_request_start_copy () from /usr/lib64/openmpi/mca_pml_ob1.so #2 0x2baa680855f6 in mca_pml_ob1_send () from /usr/lib64/openmpi/mca_pml_ob1.so #3 0x2baa61f43182 in PMPI_Send () from /usr/lib64/libmpi.so.0 #4 0x0040ca04 in MPI::Comm::Send () #5 0x00409700 in main () On rank 2: (gdb) bt #0 0x2b933d555ac7 in sched_yield () from /lib/libc.so.6 #1 0x2b9341efe775 in mca_pml_ob1_send () from /usr/lib64/openmpi/mca_pml_ob1.so #2 0x2b933bdbc182 in PMPI_Send () from /usr/lib64/libmpi.so.0 #3 0x0040ca04 in MPI::Comm::Send () #4 0x00409700 in main () Anyone got a clue? -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! pgpkeiCoamlp_.pgp Description: PGP signature
Re: [OMPI users] Begginers question: why does this program
Giovani: Which compiler are you using? Also, you didn't mention this, but does "mpirun hostname" give the expected response? I (also new) had a hang similar to what you are describing due to ompi getting confused as to which of two network interfaces to use - "mpirun hostname" would hang when started on certain nodes. This problem was resolved by telling ompi which network interface to use (I forget the option needed to do this off the top of my head, but it is in the FAQ somewhere). Good luck, Mark
Re: [OMPI users] Begginers question: why does this program
Hi Mark Compiler and flags: sys-devel/gcc-4.1.2 USE="doc* fortran gtk mudflap nls (-altivec) -bootstrap -build -d -gcj (-hardened) -ip28 -ip32r10k -libffi% (-multilib) -multislot (-n32) (-n64) -nocxx -objc -objc++ -objc-gc -test -vanilla" Network stuff: sonja gfaccin # ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:33166 errors:0 dropped:0 overruns:0 frame:0 TX packets:33166 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9846970 (9.3 Mb) TX bytes:9846970 (9.3 Mb) wlan0 Link encap:Ethernet HWaddr 00:1C:BF:24:24:91 inet addr:192.168.1.50 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::21c:bfff:fe24:2491/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5944 errors:0 dropped:0 overruns:0 frame:0 TX packets:6343 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3058968 (2.9 Mb) TX bytes:1713598 (1.6 Mb) wmaster0 Link encap:UNSPEC HWaddr 00-1C-BF-24-24-91-60-00-00-00-00-00-00-00-00 -00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) I have 2 cards in my laptop, one is an ethernet one that's not enabled (no kernel modules loaded). The other one is the wireless card, which is enabled. Those 2 interfaces appear because the driver creates them. The real one is wlan0. I'll try to find in the faq where is this flag to specify the card, just in case MPI might be trying to use wmaster0. Let's see if it works. Thanks! Giovani Mark Kosmowski escreveu: Giovani: Which compiler are you using? Also, you didn't mention this, but does "mpirun hostname" give the expected response? I (also new) had a hang similar to what you are describing due to ompi getting confused as to which of two network interfaces to use - "mpirun hostname" would hang when started on certain nodes. This problem was resolved by telling ompi which network interface to use (I forget the option needed to do this off the top of my head, but it is in the FAQ somewhere). Good luck, Mark ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!
Re: [OMPI users] RPM build errors when creating multiple rpms
On Mar 17, 2008, at 2:34 PM, Christopher Irving wrote: Well that fixed the errors for the case prefix=/usr but after looking at the spec file I suspected it would cause a problem if you used the install_in_opt option. So I tried it and got the following errors: RPM build errors: Installed (but unpackaged) file(s) found: /opt/openmpi/1.2.5/etc/openmpi-default-hostfile /opt/openmpi/1.2.5/etc/openmpi-mca-params.conf /opt/openmpi/1.2.5/etc/openmpi-totalview.tcl I just don't think the inclusion of _sysconfdir needs to be wrapped in a condition statement. It needs to be included in either case, installing to /opt or to /usr, and will already be correctly defined for both. So in the new spec file if you get rid of line 651 - %if ! %{sysconfdir_in_prefix} - and the closing endif on 653 it will work for both cases. Hmm. I'm having problems getting that to fail (even with a 1.2.5 tarball and install_in_opt=1). That %if is there because I was running into errors when rpm would complain that some files were listed twice (e.g., under %{prefix} and %{sysconfdir}). I must not be understanding exactly what you're running if I can't reproduce the problem. Can you specify your exact rpmbuild command? -- Jeff Squyres Cisco Systems
Re: [OMPI users] Begginers question: why does this program
Yep, setting the card manually did not solve it. I'm compiling the pre-release version now. Let's see if it works. Giovani Giovani Faccin escreveu: Hi Mark Compiler and flags: sys-devel/gcc-4.1.2 USE="doc* fortran gtk mudflap nls (-altivec) -bootstrap -build -d -gcj (-hardened) -ip28 -ip32r10k -libffi% (-multilib) -multislot (-n32) (-n64) -nocxx -objc -objc++ -objc-gc -test -vanilla" Network stuff: sonja gfaccin # ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:33166 errors:0 dropped:0 overruns:0 frame:0 TX packets:33166 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9846970 (9.3 Mb) TX bytes:9846970 (9.3 Mb) wlan0 Link encap:Ethernet HWaddr 00:1C:BF:24:24:91 inet addr:192.168.1.50 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::21c:bfff:fe24:2491/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5944 errors:0 dropped:0 overruns:0 frame:0 TX packets:6343 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3058968 (2.9 Mb) TX bytes:1713598 (1.6 Mb) wmaster0 Link encap:UNSPEC HWaddr 00-1C-BF-24-24-91-60-00-00-00-00-00-00-00-00 -00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) I have 2 cards in my laptop, one is an ethernet one that's not enabled (no kernel modules loaded). The other one is the wireless card, which is enabled. Those 2 interfaces appear because the driver creates them. The real one is wlan0. I'll try to find in the faq where is this flag to specify the card, just in case MPI might be trying to use wmaster0. Let's see if it works. Thanks! Giovani Mark Kosmowski escreveu: Giovani: Which compiler are you using? Also, you didn't mention this, but does "mpirun hostname" give the expected response? I (also new) had a hang similar to what you are describing due to ompi getting confused as to which of two network interfaces to use - "mpirun hostname" would hang when started on certain nodes. This problem was resolved by telling ompi which network interface to use (I forget the option needed to do this off the top of my head, but it is in the FAQ somewhere). Good luck, Mark ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!
Re: [OMPI users] Begginers question: why does this program
On Mar 18, 2008, at 8:38 AM, Giovani Faccin wrote: Yep, setting the card manually did not solve it. I would not think that it would. Generally, if OMPI can't figure out your network configuration, it'll be an "all or nothing" kind of failure. The fact that your program runs for a long while and then eventually stalls indicates that OMPI was likely able to figure out your network config ok. I'm compiling the pre-release version now. Let's see if it works. Good. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Begginers question: why does this program
Ok, I uninstalled the previous version. Then downloaded the pre-release version. Unpacked it, configure, make, make install When running MPICC I get this: mpiCC: error while loading shared libraries: libopen-pal.so.0: cannot open shared object file: No such file or directory $whereis libopen-pal libopen-pal: /usr/local/lib/libopen-pal.so /usr/local/lib/libopen-pal.la So the library exists. How can I make mpiCC know it's location? Thanks! Giovani Giovani Faccin escreveu: Yep, setting the card manually did not solve it. I'm compiling the pre-release version now. Let's see if it works. Giovani Giovani Faccin escreveu: Hi Mark Compiler and flags: sys-devel/gcc-4.1.2 USE="doc* fortran gtk mudflap nls (-altivec) -bootstrap -build -d -gcj (-hardened) -ip28 -ip32r10k -libffi% (-multilib) -multislot (-n32) (-n64) -nocxx -objc -objc++ -objc-gc -test -vanilla" Network stuff: sonja gfaccin # ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:33166 errors:0 dropped:0 overruns:0 frame:0 TX packets:33166 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9846970 (9.3 Mb) TX bytes:9846970 (9.3 Mb) wlan0 Link encap:Ethernet HWaddr 00:1C:BF:24:24:91 inet addr:192.168.1.50 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::21c:bfff:fe24:2491/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5944 errors:0 dropped:0 overruns:0 frame:0 TX packets:6343 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3058968 (2.9 Mb) TX bytes:1713598 (1.6 Mb) wmaster0 Link encap:UNSPEC HWaddr 00-1C-BF-24-24-91-60-00-00-00-00-00-00-00-00 -00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) I have 2 cards in my laptop, one is an ethernet one that's not enabled (no kernel modules loaded). The other one is the wireless card, which is enabled. Those 2 interfaces appear because the driver creates them. The real one is wlan0. I'll try to find in the faq where is this flag to specify the card, just in case MPI might be trying to use wmaster0. Let's see if it works. Thanks! Giovani Mark Kosmowski escreveu: Giovani: Which compiler are you using? Also, you didn't mention this, but does "mpirun hostname" give the expected response? I (also new) had a hang similar to what you are describing due to ompi getting confused as to which of two network interfaces to use - "mpirun hostname" would hang when started on certain nodes. This problem was resolved by telling ompi which network interface to use (I forget the option needed to do this off the top of my head, but it is in the FAQ somewhere). Good luck, Mark ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!
Re: [OMPI users] Begginers question: why does this program hangs?
Jeff hinted the real problem in his email. Even if the program use the correct MPI functions, it is not 100% correct. It might pass in some situations, but can lead to fake "deadlocks" in others. The problem come from the flow control. If the messages are small (which is the case in the test example), Open MPI will send them eagerly. Without a flow control, these messages will be buffered by the receiver, which will exhaust the memory on the receiver. Once this happens, some of the messages may get dropped, but the most visible result, is that the progress will happens very (VERY) slowly. Adding a MPI_Barrier every 100 iterations will solve the problem. george. PS: A very similar problem was discussed on the mailing list few days ago. Please read the thread to see a more detailed explanation, as well as another solution to solve it. On Mar 18, 2008, at 7:48 AM, Andreas Schäfer wrote: OK, this is strange. I've rerun the test and got it to block, too. Although repeated tests show that those are rare (sometimes the program runs smoothly without blocking, but in about 30% of the cases it hangs just like you said). On 08:11 Tue 18 Mar , Giovani Faccin wrote: I'm using openmpi-1.2.5. It was installed using my distro's (Gentoo) default package: sys-cluster/openmpi-1.2.5 USE="fortran ipv6 -debug -heterogeneous - nocxx -pbs -romio -smp -threads" Just like me. I've attached gdb to all three processes. On rank 0 I get the following backtrace: (gdb) bt #0 0x2ada849b3f16 in mca_btl_sm_component_progress () from /usr/lib64/openmpi/mca_btl_sm.so #1 0x2ada845a71da in mca_bml_r2_progress () from /usr/lib64/ openmpi/mca_bml_r2.so #2 0x2ada7e6fbbea in opal_progress () from /usr/lib64/libopen- pal.so.0 #3 0x2ada8439a9a5 in mca_pml_ob1_recv () from /usr/lib64/ openmpi/mca_pml_ob1.so #4 0x2ada7e2570a8 in PMPI_Recv () from /usr/lib64/libmpi.so.0 #5 0x0040c9ae in MPI::Comm::Recv () #6 0x00409607 in main () On rank 1: (gdb) bt #0 0x2baa6869bcc0 in mca_btl_sm_send () from /usr/lib64/openmpi/ mca_btl_sm.so #1 0x2baa6808a93d in mca_pml_ob1_send_request_start_copy () from /usr/lib64/openmpi/mca_pml_ob1.so #2 0x2baa680855f6 in mca_pml_ob1_send () from /usr/lib64/ openmpi/mca_pml_ob1.so #3 0x2baa61f43182 in PMPI_Send () from /usr/lib64/libmpi.so.0 #4 0x0040ca04 in MPI::Comm::Send () #5 0x00409700 in main () On rank 2: (gdb) bt #0 0x2b933d555ac7 in sched_yield () from /lib/libc.so.6 #1 0x2b9341efe775 in mca_pml_ob1_send () from /usr/lib64/ openmpi/mca_pml_ob1.so #2 0x2b933bdbc182 in PMPI_Send () from /usr/lib64/libmpi.so.0 #3 0x0040ca04 in MPI::Comm::Send () #4 0x00409700 in main () Anyone got a clue? -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Begginers question: why does this program
As indicated in the FAQ you should add the directory where Open MPI was installed to the LD_LIBRARY_PATH. george. On Mar 18, 2008, at 8:57 AM, Giovani Faccin wrote: Ok, I uninstalled the previous version. Then downloaded the pre- release version. Unpacked it, configure, make, make install When running MPICC I get this: mpiCC: error while loading shared libraries: libopen-pal.so.0: cannot open shared object file: No such file or directory $whereis libopen-pal libopen-pal: /usr/local/lib/libopen-pal.so /usr/local/lib/libopen- pal.la So the library exists. How can I make mpiCC know it's location? Thanks! Giovani Giovani Faccin escreveu: Yep, setting the card manually did not solve it. I'm compiling the pre-release version now. Let's see if it works. Giovani Giovani Faccin escreveu: Hi Mark Compiler and flags: sys-devel/gcc-4.1.2 USE="doc* fortran gtk mudflap nls (-altivec) - bootstrap -build -d -gcj (-hardened) -ip28 -ip32r10k -libffi% (- multilib) -multislot (-n32) (-n64) -nocxx -objc -objc++ -objc-gc - test -vanilla" Network stuff: sonja gfaccin # ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:33166 errors:0 dropped:0 overruns:0 frame:0 TX packets:33166 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9846970 (9.3 Mb) TX bytes:9846970 (9.3 Mb) wlan0 Link encap:Ethernet HWaddr 00:1C:BF:24:24:91 inet addr:192.168.1.50 Bcast:192.168.0.255 Mask: 255.255.255.0 inet6 addr: fe80::21c:bfff:fe24:2491/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5944 errors:0 dropped:0 overruns:0 frame:0 TX packets:6343 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3058968 (2.9 Mb) TX bytes:1713598 (1.6 Mb) wmaster0 Link encap:UNSPEC HWaddr 00-1C- BF-24-24-91-60-00-00-00-00-00-00-00-00 -00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) I have 2 cards in my laptop, one is an ethernet one that's not enabled (no kernel modules loaded). The other one is the wireless card, which is enabled. Those 2 interfaces appear because the driver creates them. The real one is wlan0. I'll try to find in the faq where is this flag to specify the card, just in case MPI might be trying to use wmaster0. Let's see if it works. Thanks! Giovani Mark Kosmowski escreveu: Giovani: Which compiler are you using? Also, you didn't mention this, but does "mpirun hostname" give the expected response? I (also new) had a hang similar to what you are describing due to ompi getting confused as to which of two network interfaces to use - "mpirun hostname" would hang when started on certain nodes. This problem was resolved by telling ompi which network interface to use (I forget the option needed to do this off the top of my head, but it is in the FAQ somewhere). Good luck, Mark ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Begginers question: why does this program hangs?
On Mar 18, 2008, at 10:32 AM, George Bosilca wrote: Jeff hinted the real problem in his email. Even if the program use the correct MPI functions, it is not 100% correct. I think we disagree here -- the sample program is correct according to the MPI spec. It's an implementation artifact that makes it deadlock. The upcoming v1.3 series doesn't suffer from this issue; we revamped our transport system to distinguish between early and normal completions. The pml_ob1_use_eager_completion MCA param was added to v1.2.6 to allow correct MPI apps to avoid this optimization -- a proper fix is coming in the v1.3 series. It might pass in some situations, but can lead to fake "deadlocks" in others. The problem come from the flow control. If the messages are small (which is the case in the test example), Open MPI will send them eagerly. Without a flow control, these messages will be buffered by the receiver, which will exhaust the memory on the receiver. Once this happens, some of the messages may get dropped, but the most visible result, is that the progress will happens very (VERY) slowly. Your text implies that we can actually *drop* (and retransmit) messages in the sm btl. That doesn't sound right to me -- is that what you meant? -- Jeff Squyres Cisco Systems
Re: [OMPI users] Begginers question: why does this program hangs?
On 10:51 Tue 18 Mar , Jeff Squyres wrote: > The upcoming v1.3 series doesn't suffer from this issue; we revamped > our transport system to distinguish between early and normal > completions. The pml_ob1_use_eager_completion MCA param was added to > v1.2.6 to allow correct MPI apps to avoid this optimization -- a > proper fix is coming in the v1.3 series. Yo, I've just tried it with the current SVN and couldn't reproduce the deadlock. Nice! Cheers -Andreas -- Andreas Schäfer Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany PGP/GPG key via keyserver I'm a bright... http://www.the-brights.net (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! pgpoP042yYVZU.pgp Description: PGP signature
Re: [OMPI users] RPM build errors when creating multiple rpms
On Tue, 2008-03-18 at 08:32 -0400, Jeff Squyres wrote: > On Mar 17, 2008, at 2:34 PM, Christopher Irving wrote: > > > Well that fixed the errors for the case prefix=/usr but after > > looking at > > the spec file I suspected it would cause a problem if you used the > > install_in_opt option. So I tried it and got the following errors: > > > >RPM build errors: > >Installed (but unpackaged) file(s) found: > > /opt/openmpi/1.2.5/etc/openmpi-default-hostfile > > /opt/openmpi/1.2.5/etc/openmpi-mca-params.conf > > /opt/openmpi/1.2.5/etc/openmpi-totalview.tcl > > > > I just don't think the inclusion of _sysconfdir needs to be wrapped > > in > > a condition statement. It needs to be included in either case, > > installing to /opt or to /usr, and will already be correctly defined > > for > > both. So in the new spec file if you get rid of line 651 - %if ! > > %{sysconfdir_in_prefix} - and the closing endif on 653 it will work > > for > > both cases. > > Hmm. I'm having problems getting that to fail (even with a 1.2.5 > tarball and install_in_opt=1). That %if is there because I was > running into errors when rpm would complain that some files were > listed twice (e.g., under %{prefix} and %{sysconfdir}). > > I must not be understanding exactly what you're running if I can't > reproduce the problem. Can you specify your exact rpmbuild command? > Okay, I'm no longer sure to which spec file you're referring. For clarity, I'm now using the spec file you pointed me to in your original reply, from revision 17372. With this file I no longer get any errors when I run: rpmbuild -bb --define 'build_all_in_one_rpm 0' --define 'configure_options \ --with-mip-f90-size=medium --with-tm=/usr/local/lib64' openmpi.spec This is great for me since this is how I want to build my rpms. However if I use the following command line with the new spec file I get the above installed (but unpackaged) errors which is fine for me but bad for anyone who wants to install in /opt. rpmbuild -bb --define 'install_in_opt 1' \ --define 'build_all_in_one_rpm 0' --define 'configure_options \ --with-mip-f90-size=medium --with-tm=/usr/local/lib64' openmpi.spec Now, if you removed line 651 and 653 from the new spec file it works for both cases. You wont get the files listed twice error because although you have the statement %dir %{_prefix} on line 649 you never have a line with just %{_prefix}. So the _prefix directory itself gets included but not all files underneath it. You've handled that by explicitly including all files and sub directories on lines 672-681 and in the runtime.file. Going back to the original spec file, the one that came with the source RPM, the problems where kind of reversed. Building with the 'install_in_opt 1' option worked just fine but when it wasn't set you got the files listed twice error as I described in my original message. -Christopher
Re: [OMPI users] RPM build errors when creating multiple rpms
On Tuesday, 18 March 2008, at 12:15:34 (-0700), Christopher Irving wrote: > Now, if you removed line 651 and 653 from the new spec file it works > for both cases. You wont get the files listed twice error because > although you have the statement %dir %{_prefix} on line 649 you > never have a line with just %{_prefix}. So the _prefix directory > itself gets included but not all files underneath it. You've > handled that by explicitly including all files and sub directories > on lines 672-681 and in the runtime.file. The only package which should own %{_prefix} is something like setup or filesystem in the core OS package set. No openmpi RPM should ever own %{_prefix}, so it should never appear in %files, either by itself or with %dir. > Going back to the original spec file, the one that came with the > source RPM, the problems where kind of reversed. Building with the > 'install_in_opt 1' option worked just fine but when it wasn't set > you got the files listed twice error as I described in my original > message. "files listed twice" messages are not errors, per se, and can usually be safely ignored. Those who are truly bothered by them can always add %exclude directives if they so choose. Michael -- Michael Jennings Linux Systems and Cluster Admin UNIX and Cluster Computing Group
Re: [OMPI users] equivalent to mpichgm --gm-recv blocking?
Hi Greg, Siekas, Greg wrote: Is it possible to get the same blocking behavior with openmpi? I'm having a difficult time getting this to work properly. The application is spinning on sched_yield which takes up a cpu core. Per its design, OpenMPI cannot block. sched_yield is all it can do to improve fairness. Patrick
[OMPI users] parallel molecular Dynamic simulations: All to All Comunication
Dear All, I was parallelising the serial molecular dynamic simulation code as given below: I have only two processors. My system is a duacore system. c-- SERIAL CODE c... DO m=1,nmol!! nmol is total number of molecules DO i=1,2 ax(i,m)=0.0d0 !acceleration ay(i,m)=0.0d0 !acceleration az(i,m)=0.0d0 !acceleration ENDDO DO j=1,nmol Ngmol(j,m)=0 ENDDO ENDDO c- c force calculations c- DO m=1,nmol-1 DO i=1,2 natom = natom +1 ibeg = inbl(natom) iend = inbl(natom+1)-1 DO ilist=ibeg,iend !! no. of neighbors j=inblst1(ilist) !! neighbor molecular label k=inblst2(ilist) !! neighbor atomic label c j,k are molecular and atomic label of neighbour list of each molecule c on each processor. C C C Interatomic distance C xij = x1(i,m) - x1(k,j) yij = y1(i,m) - y1(k,j) zij = z1(i,m) - z1(k,j) C __ C C Apply periodic boundary conditions C __ dpbcx = - boxx*dnint(xij/boxx) dpbcy = - boxy*dnint(yij/boxy) dpbcz = - boxz*dnint(zij/boxz) xij = xij + dpbcx yij = yij + dpbcy zij = zij + dpbcz rij2 = xij*xij + yij*yij + zij*zij C C C Calculate forces C IF (rij2.le.rcutsq) then rij = dsqrt(rij2) r_2 = sig1sq/rij2 r_6 = r_2*r_2*r_2 r_12 = r_6*r_6 pot_lj = pot_lj+((r_12-r_6) + rij*vfc-vc) !! need 4*eps1 fij = 24.0d0*eps1*((2*r_12-r_6)/rij2 - fc/rij) fxlj = fij*xij fylj = fij*yij fzlj = fij*zij ax(i,m) = ax(i,m) + fxlj ay(i,m) = ay(i,m) + fylj az(i,m) = az(i,m) + fzlj ax(k,j) = ax(k,j) - fxlj ay(k,j) = ay(k,j) - fylj az(k,j) = az(k,j) - fzlj pconf= pconf+(xij*fxlj + yij*fylj + zij*fzlj) IF (ngmol(j,m).eq.0) then xmolij = xmol(m) - xmol(j) + dpbcx ymolij = ymol(m) - ymol(j) + dpbcy zmolij = zmol(m) - zmol(j) + dpbcz rmolij = dsqrt(xmolij*xmolij+ymolij*ymolij &+zmolij*zmolij) nr = dnint(rmolij/dgr) ng12(nr) = ng12(nr)+2 ngmol(j,m) = 1 ENDIF ENDIF ENDDO ENDDO ENDDO DO m=1,nmol DO i=1,2 write(*,100)ax(i,m),ay(i,m),az(i,m) ENDDO ENDDO and below is the parallelised part c c PARALLEL CODE: c-- DO m=1,nmol DO i=1,2 ax(i,m)=0.0d0 ay(i,m)=0.0d0 az(i,m)=0.0d0 ENDDO DO j=1,nmol ngmol(j,m)=0 ENDDO ENDDO CALL para_range(1, nmol, nprocs, myrank, nmolsta, nmolend) DO m=nmolsta,nmolend-1 !!nmol is diveded into two parts !!and nmolsta and nmolend are starting and ending !!index for each processor DO i=1,2 ibeg = inbl(natom) iend = inbl(natom+1)-1 DO ilist=ibeg,iend !! no. of neighbors j=inblst1(ilist) !! neighbor molecular label k=inblst2(ilist) !! neighbor atomic label c C C Interatomic distance C xij = x1(i,m) - x1(k,j) yij = y1(i,m) - y1(k,j) zij = z1(i,m) - z1(k,j) dpbcx = - boxx*dnint(xij/boxx) dpbcy = - boxy*dnint(yij/boxy) dpbcz = - boxz*dnint(zij/boxz) xij = xij + dpbcx yij = yij + dpbcy zij = zij + dpbcz rij2 = xij*xij + yij*yij + zij*zij IF (rij2.le.rcutsq) then
Re: [OMPI users] RPM build errors when creating multiple rpms
On Tue, 2008-03-18 at 12:28 -0700, Michael Jennings wrote: > On Tuesday, 18 March 2008, at 12:15:34 (-0700), > Christopher Irving wrote: > > > Now, if you removed line 651 and 653 from the new spec file it works > > for both cases. You wont get the files listed twice error because > > although you have the statement %dir %{_prefix} on line 649 you > > never have a line with just %{_prefix}. So the _prefix directory > > itself gets included but not all files underneath it. You've > > handled that by explicitly including all files and sub directories > > on lines 672-681 and in the runtime.file. > > The only package which should own %{_prefix} is something like setup > or filesystem in the core OS package set. No openmpi RPM should ever > own %{_prefix}, so it should never appear in %files, either by itself > or with %dir. > Well you're half correct. You're thinking that _prefix is always defined as /usr. But in the case were install_in_opt is defined they have redefined _prefix to be /opt/%{name}/%{version} in which case it is fine for one of the openmpi rpms to claim that directory with a %dir directive. However I think you missed the point. I'm not suggesting they need to a %{_prefix} statement in the %files section, I'm just pointing out what's not the source of the duplicated files. In other words %dir %{_prefix} is not the same as %{_prefix} and wont cause all the files in _prefix to be included. > > Going back to the original spec file, the one that came with the > > source RPM, the problems where kind of reversed. Building with the > > 'install_in_opt 1' option worked just fine but when it wasn't set > > you got the files listed twice error as I described in my original > > message. > > "files listed twice" messages are not errors, per se, and can usually > be safely ignored. Those who are truly bothered by them can always > add %exclude directives if they so choose. > > Michael It can't be safely ignored when it causes rpm build to fail. Also you don't want to use an %exclude because that would prevent the specified files from ever getting included which is not the desired result. It's much easier and makes a lot more sense to remove the source of the duplicated inclusion. Which is exactly what they did and why that's no longer the problem with the new spec file. -C