Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
On Wed, Feb 27, 2008 at 10:01:06AM -0600, Brian W. Barrett wrote: > The only solution to this problem is to suck it up and audit all the code > to eliminate calls to opal_progress() in situations where infinite > recursion can result. It's going to be long and painful, but there's no > quick fix (IMHO). > The trick is to call progress only from functions that are called directly by a user process. Never call progress from a callback functions. The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT(). They should be changed to OMPI_FREE_LIST_GET() and dial with NULL return value. -- Gleb.
Re: [OMPI users] Cannot build 1.2.5
To clean this up for the web archives, we were able to get it to work by using '--disable-dlopen' Tim Tim Prins wrote: Scott, I can replicate this on big red. Seems to be a libtool problem. I'll investigate... Thanks, Tim Teige, Scott W wrote: Hi all, Attempting a build of 1.2.5 on a ppc machine, particulars: uname -a Linux s10c2b2 2.6.5-7.286-pseries64-lustre-1.4.10.1 #2 SMP Tue Jun 26 11:36:04 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux Error message (many times) ../../../opal/.libs/libopen-pal.a(dlopen.o)(.opd+0x0): In function `__argz_next': : multiple definition of `__argz_next' ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x0): first defined here Output from ./configure and make all is attached. Thanks for the help, S. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
Hi, and thanks for the feedback everyone. George Bosilca wrote: Brian is completely right. Here is a more detailed description of this problem. [] On the other side, I hope that not many users write such applications. This is the best way to completely kill the performances of any MPI implementation, by overloading one process with messages. This is exactly what MPI_Reduce and MPI_Gather do, one process will get the final result and all other processes only have to send some data. This behavior only arises when the gather or the reduce use a very flat tree, and only for short messages. Because of the short messages there is no handshake between the sender and the receiver, which will make all messages unexpected, and the flat tree guarantee that there will be a lot of small messages. If you add a barrier every now and then (100 iterations) this problem will never happens. I have done some more testing. Of the tested parameters, I'm observing this behaviour with group sizes from 16-44, and from 1 to 32768 integers in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 16-44 and from 1 to 4096 integers (per node). In other words, it actually happens with other tree configurations and larger packet sizes :-/ By the way, I'm also observing crashes with MPI_Broadcast (groups of size 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Can a sender crash because it runs out of buffer space as well? -- snip -- /home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 ./ompi-crash 16384 1 3000 { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 1 [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) -- snip -- One more thing, doing a lot of collective in a loop and computing the total time is not the correct way to evaluate the cost of any collective communication, simply because you will favor all algorithms based on pipelining. There is plenty of literature about this topic. george. As I said in the original e-mail: I had only thrown them in for a bit of sanity checking. I expected funny numbers, but not that OpenMPI would crash. The original idea was just to make a quick comparison of Allreduce, Allgather and Alltoall in LAM and OpenMPI. The opportunity for pipelining the operations there is rather small since they can't get much out of phase with each other. Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
On Thu, 28 Feb 2008, Gleb Natapov wrote: > The trick is to call progress only from functions that are called > directly by a user process. Never call progress from a callback functions. > The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT(). They > should be changed to OMPI_FREE_LIST_GET() and dial with NULL return value. Right -- and it should be easy to find more offenders by having an assert statement soak in the builds for a while (or by default in debug mode). Was if it was ever part of the (or a) design to allow re-entrant calls to progress from the same calling thread ? It can be done but callers have to have a holistic view of how other components require and make the progress happen -- this isn't compatible with the Open MPI model of independent dynamically loadable components. -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale)
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
In this particular case, I don't think the solution is that obvious. If you look at the stack in the original email, you will notice how we get into this. The problem here, is that the FREE_LIST_WAIT is used to get a fragment to store an unexpected message. If this macro return NULL (in other words the PML is unable to store the unexpected message), what do you expect to happen ? Drop the message ? Ask the BTL to hold it for a while ? How about ordering ? It is unfortunate to say it, only few days after we had the discussion about the flow control, but the only correct solution here is to add PML level flow control ... george. On Feb 28, 2008, at 2:55 PM, Christian Bell wrote: On Thu, 28 Feb 2008, Gleb Natapov wrote: The trick is to call progress only from functions that are called directly by a user process. Never call progress from a callback functions. The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT(). They should be changed to OMPI_FREE_LIST_GET() and dial with NULL return value. Right -- and it should be easy to find more offenders by having an assert statement soak in the builds for a while (or by default in debug mode). Was if it was ever part of the (or a) design to allow re-entrant calls to progress from the same calling thread ? It can be done but callers have to have a holistic view of how other components require and make the progress happen -- this isn't compatible with the Open MPI model of independent dynamically loadable components. -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
On Feb 28, 2008, at 2:45 PM, John Markus Bjørndalen wrote: Hi, and thanks for the feedback everyone. George Bosilca wrote: Brian is completely right. Here is a more detailed description of this problem. [] On the other side, I hope that not many users write such applications. This is the best way to completely kill the performances of any MPI implementation, by overloading one process with messages. This is exactly what MPI_Reduce and MPI_Gather do, one process will get the final result and all other processes only have to send some data. This behavior only arises when the gather or the reduce use a very flat tree, and only for short messages. Because of the short messages there is no handshake between the sender and the receiver, which will make all messages unexpected, and the flat tree guarantee that there will be a lot of small messages. If you add a barrier every now and then (100 iterations) this problem will never happens. I have done some more testing. Of the tested parameters, I'm observing this behaviour with group sizes from 16-44, and from 1 to 32768 integers in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 16-44 and from 1 to 4096 integers (per node). In other words, it actually happens with other tree configurations and larger packet sizes :-/ This is the limit for the rendez-vous protocol over TCP. And is the upper limit where this problem will arise. I have a strong doubt that is possible to create the same problem with messages larger than the eager size of your BTL ... By the way, I'm also observing crashes with MPI_Broadcast (groups of size 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Can a sender crash because it runs out of buffer space as well? I don't think the root crashed. I guess that one of the other nodes crashed, the root got a bad socket (which is what the first error message seems to indicate), and get terminated. As the output is not synchronized between the nodes, one cannot rely on its order nor contents. Moreover, mpirun report that the root was killed with signal 15, which is how we cleanup the remaining processes when we detect that something really bad (like a seg fault) happened in the parallel application. -- snip -- /home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 ./ompi-crash 16384 1 3000 { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 1 [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) -- snip -- One more thing, doing a lot of collective in a loop and computing the total time is not the correct way to evaluate the cost of any collective communication, simply because you will favor all algorithms based on pipelining. There is plenty of literature about this topic. george. As I said in the original e-mail: I had only thrown them in for a bit of sanity checking. I expected funny numbers, but not that OpenMPI would crash. The original idea was just to make a quick comparison of Allreduce, Allgather and Alltoall in LAM and OpenMPI. The opportunity for pipelining the operations there is rather small since they can't get much out of phase with each other. There are many differences between the routed and non routed collectives. All errors that you reported so far are related to rooted collectives, which make sense. I didn't state that it is normal that Open MPI do not behave [sic]. I wonder if you can get such errors with non routed collectives (such as allreduce, allgather and alltoall), or with messages larger than the eager size ? If you type "ompi_info --param btl tcp", you will see what is the eager size for the TCP BTL. Everything smaller than this size will be send eagerly; have the opportunity to became unexpected on the receiver side and can lead to this problem. As a quick test, you can add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and this problem will not happen with for size over the 2K. This was the original solution for the flow control problem. If you know your application will generate thousands of unexpected messages, then you should set the eager limit to zero. Thanks, george. Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
[OMPI users] OpenMPI on intel core 2 duo machine for parallel computation.
Dear All, I am a graduate student working on molecular dynamic simulation. My professor/adviser is planning to buy Linux based clusters. But before that he wanted me to parallelize a serial code on molecular dynamic simulations and test it on a intelcore 2 duo machine with fedora 8 on it. I have parallelised my code in fortran 77 using MPI. I have installed OpenMPI and compiling the code using mpif77 -g -o code code.f and running it using mpirun -np 2 ./code. I have a couple of questions to ask you: 1. Is it possible to use a duo core or any multi core machine for parallel computations? 2. Is that a a right procedure to run a parallel job as explained above?(using mpif77 -g -o code code.f and running it using mpirun -np 2 ./code) 3. How do I know my code is being run on both the processors.(I am a chemical engineering student and new to computational aspects) 4. If what I have done is wrong can anyone please explain me how to do it? Here is my CPU details: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips: 5322.87 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips: 5319.97 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Thank you Ramesh
[OMPI users] ScaLapack and BLACS on Leopard
Hey Folks, Anyone got ScaLapack and BLACS working and not just compiled under OSX10.5 in 64-bit mode? The FAQ site directions were followed and every thing compiles just fine. But ALL of the single precision routines and many of the double precisions routines in the TESTING directory fail with system lib errors. I've gotten some interesting errors and am wondering what the magic touch is. Regards, Greg
Re: [OMPI users] OpenMPI on intel core 2 duo machine for parallel computation.
On Feb 28, 2008, at 5:32 PM, Chembeti, Ramesh (S&T-Student) wrote: Dear All, I am a graduate student working on molecular dynamic simulation. My professor/adviser is planning to buy Linux based clusters. But before that he wanted me to parallelize a serial code on molecular dynamic simulations and test it on a intelcore 2 duo machine with fedora 8 on it. I have parallelised my code in fortran 77 using MPI. I have installed OpenMPI and compiling the code using mpif77 - g -o code code.f I would make sure to always use some sort of optimizer mpif77 -O2 -o code code.f atleast, higher (-O3, -fastsse) if it gives the right results, look up your compiler docs. and running it using mpirun -np 2 ./code. I have a couple of questions to ask you: 1. Is it possible to use a duo core or any multi core machine for parallel computations? Yes a core is really another cpu, duel core is just two cpus packed (with some changes) into a single socket so to MPI it is the same as a duel cpu machine. We use duel socket duel core all the time (mpirun -np 4 app) all the time. 2. Is that a a right procedure to run a parallel job as explained above?(using mpif77 -g -o code code.f and running it using mpirun -np 2 ./code) Yes this is correct, Once you have more than one node you will need to somehow tell mpirun use host x and host y, but right now it just assumes 'localhost' which is correct. Check out: http://www.open-mpi.org/faq/?category=running 3. How do I know my code is being run on both the processors.(I am a chemical engineering student and new to computational aspects) Run 'top' you should see two processes, one for each cpu at 100%, there should be a system summary at the top that gives you a percent for the entire machine make sure idle is 0%. 4. If what I have done is wrong can anyone please explain me how to do it? Nope looks like a good start, always check out man pages man mpirun If you guys have cluster guys on campus is best not to spend your time being admins, have some Unix SA's run the cluster and you focus on your science. But thats my opinion (and observations). Here is my CPU details: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips: 5322.87 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips: 5319.97 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Thank you Ramesh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985
Re: [OMPI users] OpenMPI on intel core 2 duo machine for parallelcomputation.
Dear Mr. Palen Thank you very much for your instant reply. I will let you know if I face any problem in future. Ramesh -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brock Palen Sent: Thursday, February 28, 2008 4:51 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI on intel core 2 duo machine for parallelcomputation. On Feb 28, 2008, at 5:32 PM, Chembeti, Ramesh (S&T-Student) wrote: > > Dear All, > > I am a graduate student working on molecular dynamic simulation. My > professor/adviser is planning to buy Linux based clusters. But > before that he wanted me to parallelize a serial code on molecular > dynamic simulations and test it on a intelcore 2 duo machine with > fedora 8 on it. I have parallelised my code in fortran 77 using > MPI. I have installed OpenMPI and compiling the code using mpif77 - > g -o code code.f I would make sure to always use some sort of optimizer mpif77 -O2 -o code code.f atleast, higher (-O3, -fastsse) if it gives the right results, look up your compiler docs. > and running it using > mpirun -np 2 ./code. I have a couple of questions to ask you: > 1. Is it possible to use a duo core or any multi core machine for > parallel computations? Yes a core is really another cpu, duel core is just two cpus packed (with some changes) into a single socket so to MPI it is the same as a duel cpu machine. We use duel socket duel core all the time (mpirun -np 4 app) all the time. > 2. Is that a a right procedure to run a parallel job as explained > above?(using mpif77 -g -o code code.f and running it using > mpirun -np 2 ./code) Yes this is correct, Once you have more than one node you will need to somehow tell mpirun use host x and host y, but right now it just assumes 'localhost' which is correct. Check out: http://www.open-mpi.org/faq/?category=running > 3. How do I know my code is being run on both the processors.(I am > a chemical engineering student and new to computational aspects) Run 'top' you should see two processes, one for each cpu at 100%, there should be a system summary at the top that gives you a percent for the entire machine make sure idle is 0%. > 4. If what I have done is wrong can anyone please explain me how to > do it? Nope looks like a good start, always check out man pages man mpirun If you guys have cluster guys on campus is best not to spend your time being admins, have some Unix SA's run the cluster and you focus on your science. But thats my opinion (and observations). > > Here is my CPU details: > processor : 0 > vendor_id : GenuineIntel > cpu family: 6 > model : 15 > model name: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz > stepping : 11 > cpu MHz : 2000.000 > cache size: 4096 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 2 > fpu : yes > fpu_exception : yes > cpuid level : 10 > wp: yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor > ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm > bogomips : 5322.87 > clflush size : 64 > cache_alignment : 64 > address sizes : 36 bits physical, 48 bits virtual > power management: > > processor : 1 > vendor_id : GenuineIntel > cpu family: 6 > model : 15 > model name: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz > stepping : 11 > cpu MHz : 2000.000 > cache size: 4096 KB > physical id : 0 > siblings : 2 > core id : 1 > cpu cores : 2 > fpu : yes > fpu_exception : yes > cpuid level : 10 > wp: yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor > ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm > bogomips : 5319.97 > clflush size : 64 > cache_alignment : 64 > address sizes : 36 bits physical, 48 bits virtual > power management: > > > Thank you > Ramesh > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI on intel core 2 duo machine for parallel computation.
On Thu, 2008-02-28 at 16:32 -0600, Chembeti, Ramesh (S&T-Student) wrote: > Dear All, > > I am a graduate student working on molecular dynamic simulation. My > professor/adviser is planning to buy Linux based clusters. But before that he > wanted me to parallelize a serial code on molecular dynamic simulations and > test it on a intelcore 2 duo machine with fedora 8 on it. I have parallelised > my code in fortran 77 using MPI. I have installed OpenMPI and compiling the > code using mpif77 -g -o code code.f and running it using > mpirun -np 2 ./code. I have a couple of questions to ask you: You have actually parallelised it, right? As in built parellelisation with MPI_ calls?
Re: [OMPI users] OpenMPI on intel core 2 duo machine forparallel computation.
Yes I have used MPI_SUBROUTINES to parallelize it. If you want me to send my code I can do that because this is my first effort towards parallel computing, so your suggestions and ideas are valuable to me. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Terry Frankcombe Sent: Thursday, February 28, 2008 6:03 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI on intel core 2 duo machine forparallel computation. On Thu, 2008-02-28 at 16:32 -0600, Chembeti, Ramesh (S&T-Student) wrote: > Dear All, > > I am a graduate student working on molecular dynamic simulation. My professor/adviser is planning to buy Linux based clusters. But before that he wanted me to parallelize a serial code on molecular dynamic simulations and test it on a intelcore 2 duo machine with fedora 8 on it. I have parallelised my code in fortran 77 using MPI. I have installed OpenMPI and compiling the code using mpif77 -g -o code code.f and running it using > mpirun -np 2 ./code. I have a couple of questions to ask you: You have actually parallelised it, right? As in built parellelisation with MPI_ calls? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users