[OMPI users] performance of MPI_Iallgatherv
Hi, I'm testing the non-blocking collective of OpenMPI-1.8. I have two nodes with Infiniband to perform allgather on totally 128MB data. I split the 128MB data into eight pieces, and perform computation and MPI_Iallgatherv() on one piece of data each iteration, hoping that the MPI_Iallgatherv() of last iteration can be overlapped with computation of current iteration. A MPI_Wait() is called at the end of last iteration. However, the total communication time (including the final wait time) is similar with that of the traditional blocking MPI_Allgatherv, even slightly higher. Following is the test pseudo-code, the source code are attached. === Using MPI_Allgatherv: for( i=0; i<8; i++ ) { // computation mytime( t_begin ); computation; mytime( t_end ); comp_time += (t_end - t_begin); // communication t_begin = t_end; MPI_Allgatherv(); mytime( t_end ); comm_time += (t_end - t_begin); } Using MPI_Iallgatherv: for( i=0; i<8; i++ ) { // computation mytime( t_begin ); computation; mytime( t_end ); comp_time += (t_end - t_begin); // communication t_begin = t_end; MPI_Iallgatherv(); mytime( t_end ); comm_time += (t_end - t_begin); } // wait for non-blocking allgather to complete mytime( t_begin ); for( i=0; i<8; i++ ) MPI_Wait; mytime( t_end ); wait_time = t_end - t_begin; == The results of Allgatherv is: [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 --host gnode102,gnode103 ./Allgatherv 128 2 | grep time Computation time : 8481279 us Communication time: 319803 us The results of Iallgatherv is: [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 --host gnode102,gnode103 ./Iallgatherv 128 2 | grep time Computation time : 8479177 us Communication time: 199046 us Wait time: 139841 us So, does this mean that current OpenMPI implementation of MPI_Iallgatherv doesn't support offloading of collective communication to dedicated cores or network interface? Best regards, Zehan #include "mpi.h" #include #include #include #include #define NS 8 // number of segment struct timeval tv; #define mytime(time) do{ \ gettimeofday(&tv,NULL); \ time=(unsigned long)(tv.tv_sec*100+tv.tv_usec);\ }while(0) int main(int argc, char** argv) { MPI_Init(&argc,&argv); int size; int rank; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(argc<2) { printf("Usage: ./allgather m [n]\n"); printf("n=1, represent m KB;"); printf("n=2, represent m MB;"); exit(-1); } int global_size;// the amount of data to allgather int local_size;// the amount of data that each process holds if(argc >= 2) global_size = atoi(argv[1]); if(argc >= 3) { if(atoi(argv[2])==2) global_size = global_size*1024*1024; // n=2, xxMB if(atoi(argv[2])==1) global_size = global_size*1024;// n=1, xxKB } local_size = global_size/size; // each process holds 1/size of the data int * global_buf; // recvbuf int * local_buf;// sendbuf global_buf = (int *) malloc(global_size*sizeof(int)); local_buf = (int *) malloc(local_size*sizeof(int)); memset(global_buf,0,global_size*sizeof(int)); memset(local_buf,0,local_size*sizeof(int)); int i,j,k; int *recvcnts; // recvcnts of MPI_Allgatherv int *displs;// displs of MPI_Allgatherv recvcnts = (int *) malloc(size*sizeof(int)); displs = (int*) malloc(size*sizeof(int)); for(i=0; i#include "mpi.h" #include #include #include #include #define NS 8 // number of segment struct timeval tv; #define mytime(time) do{ \ gettimeofday(&tv,NULL); \ time=(unsigned long)(tv.tv_sec*100+tv.tv_usec);\ }while(0) int main(int argc, char** argv) { MPI_Init(&argc,&argv); int size; int rank; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(argc<2) { printf("Usage: ./allgather m [n]\n"); printf("n=1, represent m KB;"); printf("n=2, represent m MB;"); exit(-1); } int global_size;// the amount of data to allgather int local_size;// the amount of data that each process holds if(argc >= 2) global_size = atoi(argv[1]); if(argc >= 3) { if(atoi(argv[2])==2) global_size = global_size*1024*1024; // n=2, xxMB if(atoi(argv[2])==1) global_size = global_size*1024;// n=1, xxKB } local_size = global_size/size; int * global_buf; // recvbuf int * local_buf;// sendbuf global_buf = (int *) malloc(global_size*sizeof(int)); local_buf = (int *) malloc(local_size*sizeof(int)); memset(global_buf,0,global_size*sizeof(int)); memset(local_buf,0,lo
Re: [OMPI users] performance of MPI_Iallgatherv
Hi Matthieu, Thanks for your suggestion. I tried MPI_Waitall(), but the results are the same. It seems the communication didn't overlap with computation. Regards, Zehan On 4/5/14, Matthieu Brucher wrote: > Hi, > > Try waiting on all gathers at the same time, not one by one (this is > what non blocking collectives are made for!) > > Cheers, > > Matthieu > > 2014-04-05 10:35 GMT+01:00 Zehan Cui : >> Hi, >> >> I'm testing the non-blocking collective of OpenMPI-1.8. >> >> I have two nodes with Infiniband to perform allgather on totally 128MB >> data. >> >> I split the 128MB data into eight pieces, and perform computation and >> MPI_Iallgatherv() on one piece of data each iteration, hoping that the >> MPI_Iallgatherv() of last iteration can be overlapped with computation of >> current iteration. A MPI_Wait() is called at the end of last iteration. >> >> However, the total communication time (including the final wait time) is >> similar with that of the traditional blocking MPI_Allgatherv, even >> slightly >> higher. >> >> >> Following is the test pseudo-code, the source code are attached. >> >> === >> >> Using MPI_Allgatherv: >> >> for( i=0; i<8; i++ ) >> { >> // computation >> mytime( t_begin ); >> computation; >> mytime( t_end ); >> comp_time += (t_end - t_begin); >> >> // communication >> t_begin = t_end; >> MPI_Allgatherv(); >> mytime( t_end ); >> comm_time += (t_end - t_begin); >> } >> >> >> Using MPI_Iallgatherv: >> >> for( i=0; i<8; i++ ) >> { >> // computation >> mytime( t_begin ); >> computation; >> mytime( t_end ); >> comp_time += (t_end - t_begin); >> >> // communication >> t_begin = t_end; >> MPI_Iallgatherv(); >> mytime( t_end ); >> comm_time += (t_end - t_begin); >> } >> >> // wait for non-blocking allgather to complete >> mytime( t_begin ); >> for( i=0; i<8; i++ ) >> MPI_Wait; >> mytime( t_end ); >> wait_time = t_end - t_begin; >> >> == >> >> The results of Allgatherv is: >> [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 >> --host >> gnode102,gnode103 ./Allgatherv 128 2 | grep time >> Computation time : 8481279 us >> Communication time: 319803 us >> >> The results of Iallgatherv is: >> [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 >> --host >> gnode102,gnode103 ./Iallgatherv 128 2 | grep time >> Computation time : 8479177 us >> Communication time: 199046 us >> Wait time: 139841 us >> >> >> So, does this mean that current OpenMPI implementation of MPI_Iallgatherv >> doesn't support offloading of collective communication to dedicated cores >> or >> network interface? >> >> Best regards, >> Zehan >> >> >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > Music band: http://liliejay.com/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Best Regards Zehan Cui(崔泽汉) --- Institute of Computing Technology, Chinese Academy of Sciences. No.6 Kexueyuan South Road Zhongguancun,Haidian District Beijing,China
Re: [OMPI users] performance of MPI_Iallgatherv
Thanks, it looks like I have to do the overlapping myself. On Tue, Apr 8, 2014 at 5:40 PM, Matthieu Brucher wrote: > Yes, usually the MPI libraries don't allow that. You can launch > another thread for the computation, make calls to MPI_Test during that > time and join at the end. > > Cheers, > > 2014-04-07 4:12 GMT+01:00 Zehan Cui : > > Hi Matthieu, > > > > Thanks for your suggestion. I tried MPI_Waitall(), but the results are > > the same. It seems the communication didn't overlap with computation. > > > > Regards, > > Zehan > > > > On 4/5/14, Matthieu Brucher wrote: > >> Hi, > >> > >> Try waiting on all gathers at the same time, not one by one (this is > >> what non blocking collectives are made for!) > >> > >> Cheers, > >> > >> Matthieu > >> > >> 2014-04-05 10:35 GMT+01:00 Zehan Cui : > >>> Hi, > >>> > >>> I'm testing the non-blocking collective of OpenMPI-1.8. > >>> > >>> I have two nodes with Infiniband to perform allgather on totally 128MB > >>> data. > >>> > >>> I split the 128MB data into eight pieces, and perform computation and > >>> MPI_Iallgatherv() on one piece of data each iteration, hoping that the > >>> MPI_Iallgatherv() of last iteration can be overlapped with computation > of > >>> current iteration. A MPI_Wait() is called at the end of last iteration. > >>> > >>> However, the total communication time (including the final wait time) > is > >>> similar with that of the traditional blocking MPI_Allgatherv, even > >>> slightly > >>> higher. > >>> > >>> > >>> Following is the test pseudo-code, the source code are attached. > >>> > >>> === > >>> > >>> Using MPI_Allgatherv: > >>> > >>> for( i=0; i<8; i++ ) > >>> { > >>> // computation > >>> mytime( t_begin ); > >>> computation; > >>> mytime( t_end ); > >>> comp_time += (t_end - t_begin); > >>> > >>> // communication > >>> t_begin = t_end; > >>> MPI_Allgatherv(); > >>> mytime( t_end ); > >>> comm_time += (t_end - t_begin); > >>> } > >>> > >>> > >>> Using MPI_Iallgatherv: > >>> > >>> for( i=0; i<8; i++ ) > >>> { > >>> // computation > >>> mytime( t_begin ); > >>> computation; > >>> mytime( t_end ); > >>> comp_time += (t_end - t_begin); > >>> > >>> // communication > >>> t_begin = t_end; > >>> MPI_Iallgatherv(); > >>> mytime( t_end ); > >>> comm_time += (t_end - t_begin); > >>> } > >>> > >>> // wait for non-blocking allgather to complete > >>> mytime( t_begin ); > >>> for( i=0; i<8; i++ ) > >>> MPI_Wait; > >>> mytime( t_end ); > >>> wait_time = t_end - t_begin; > >>> > >>> == > >>> > >>> The results of Allgatherv is: > >>> [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 > >>> --host > >>> gnode102,gnode103 ./Allgatherv 128 2 | grep time > >>> Computation time : 8481279 us > >>> Communication time: 319803 us > >>> > >>> The results of Iallgatherv is: > >>> [cmy@gnode102 test_nbc]$ /home3/cmy/czh/opt/ompi-1.8/bin/mpirun -n 2 > >>> --host > >>> gnode102,gnode103 ./Iallgatherv 128 2 | grep time > >>> Computation time : 8479177 us > >>> Communication time: 199046 us > >>> Wait time: 139841 us > >>> > >>> > >>> So, does this mean that current OpenMPI implementation of > MPI_Iallgatherv > >>> doesn't support offloading of collective communication to dedicated > cores > >>> or > >>> network interface? > >>> > >>> Best regards, > >>> Zehan > >>> > >>> > >>> > >>> > >>> > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> -- > >> Information System Engineer, Ph.D. > >> Blog: http://matt.eifelle.com > >> LinkedIn: http://www.linkedin.com/in/matthieubrucher > >> Music band: http://liliejay.com/ > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > -- > > Best Regards > > Zehan Cui(崔泽汉) > > --- > > Institute of Computing Technology, Chinese Academy of Sciences. > > No.6 Kexueyuan South Road Zhongguancun,Haidian District Beijing,China > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > Music band: http://liliejay.com/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores
Hi Yuping, Maybe using multi-threads inside a socket, and MPI among sockets is better choice for such NUMA platform. Multi-threads can exploit the benefit of share memory, and MPI can alleviate the cost of non-uniform memory access. regards, Zehan On Tue, Jun 17, 2014 at 6:19 AM, Yuping Sun wrote: > Dear All: > > I bought a 64 core workstation and installed NASA fun3d with open mpi > 1.6.5. Then I started to test run fun3d using 16, 32, 48 cores. However the > performance of the fun3d run is bad. I got data below: > > the run command is (it is for 32 core as an example) > mpiexec -np 32 --bysocket --bind-to-socket > ~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi > --time_timestep_loop --animation_freq -1 > screen.dump_bs30 > > CPUs timesiterationstime/it > 60678s30it22.61s > 48702s30it23.40s > 32734s30it24.50s > 16894s30it29.80s > > You can see using 60 cores, to run 30 iteration, FUN3D will complete in > 678 seconds, roughly 22.61 second per iteration. > > Using 16 cores, to run 30 iteration, FUN3D will complete in 894 seconds, > roughly 29.8 seconds per iteration. > > the data above shows FUN3D run using mpirun does not scale at all! I used > to run fun3d with mpirun on a 8 core WS, and it scales well. > The same job to run on a linux cluster scales well. > > Would you all give me some advice to improve the performance loss when I > increase the use of more cores, or how to run mpirun with proper options to > get a linear scaling when using 16 to 32 to 48 cores? > > Thank you. > > Yuping > > > > > > > > > > > > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24654.php >
[OMPI users] unknow option "--tree-spawn" with OpenMPI-1.7.1
Hi, I have just install OpenMPI-1.7.1 and cannot get it running. Here is the error messages: [cmy@gLoginNode1 test_nbc]$ mpirun -n 4 -host gnode100 ./hello [gnode100:31789] Error: unknown option "--tree-spawn" input in flex scanner failed [gLoginNode1:14920] [[62542,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 362 [gLoginNode1:14920] [[62542,0],0] attempted to send to [[62542,0],1]: tag 15 [gLoginNode1:14920] [[62542,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/grpcomm_base_xcast.c at line 166 I have run it on several nodes, and got the same messages. - Zehan Cui
Re: [OMPI users] unknow option "--tree-spawn" with OpenMPI-1.7.1
I think the PATH setting is ok. I forgot to mention that it run well on local machine. The PATH setting on the local machine is [cmy@gLoginNode1 ~]$ echo $PATH /home/cmy/clc/benchmarks/nasm-2.09.10:*/home3/cmy/czh/opt/ompi-1.7.1/bin/* :/home3/cmy/czh/opt/autoconf-2.69/bin/:/home3/cmy/czh/opt/mvapich2-1.9/bin/:/home/cmy/wr/local/ft-mvapich2-1.8a2/bin:/home/cmy/wr/local/mvapich2-1.8a2/bin:/usr/mpi/gcc/mvapich2-1.4.1/bin:/home3/cmy/czh/ompi/bin/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/bin:/home/cmy/huangyb/gem5/swig/bin/:/home/cmy/huangyb/gem5/scons/bin::/home/cmy/huangyb/local/mercurial/bin:/home/cmy/huangyb/local/python-2.7.3/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/usr/mpi/gcc/openmpi-1.4.2/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/home/cmy/tgm/cmake/bin:/usr/local/mvapich2/bin:/usr/local/mpich-pgi/bin:/opt/pgi/linux86-64/7.0-2/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/opt/gridviewnew/pbs//dispatcher-sched//bin:/opt/gridviewnew/pbs//dispatcher-sched//sbin:/opt/gridviewnew/pbs//dispatcher//bin:/opt/gridviewnew/pbs//dispatcher//sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/cmy/zxx/work_spring_2011/iaca-lin32/bin:/home/cmy/bin:/home/tgm/ljj/software/dmidecode-2.11/:/usr/local/oski_2007/include [cmy@gLoginNode1 ~]$ echo $LD_LIBRARY_PATH */home3/cmy/czh/opt/ompi-1.7.1/lib/* :/home3/cmy/czh/opt/mvapich2-1.9/lib/:/home/cmy/wr/local/ft-mvapich2-1.8a2/lib:/home/cmy/wr/local/mvapich2-1.8a2/lib:/usr/mpi/gcc/mvapich2-1.4.1/lib:/home3/cmy/czh/ompi/lib/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib64:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib/:/home/cmy/huangyb/local/python-2.7.3/lib/:/usr/local/lib64:/usr/local/lib:/home/cmy/clc/DRAMSim2:/home/SOFT/intel/Compiler/11.0/083/lib/intel64:/home/cmy/zxx/oski-icc/lib/oski:/usr/mpi/gcc/openmpi-1.4.2/lib/:/usr/lib/python2.4/config:/home/SOFT/intel/Compiler/11.0/083/mkl/lib/em64t:/home/cmy/tgm/hpx/build/linux/lib:/home/cmy/yanjie/boost/lib:/usr/local/mvapich2/lib:/home/cmy/yanjie/qthread/lib:/opt/gridviewnew/pbs//dispatcher//lib::/usr/local/lib64:/usr/local/lib:/home/cmy/zxx/work_spring_2011/iaca-lin32/lib The path setting on gnode100 is the same too [cmy@gnode100 ~]$ [cmy@gnode100 ~]$ echo $PATH /home/cmy/clc/benchmarks/nasm-2.09.10:*/home3/cmy/czh/opt/ompi-1.7.1/bin/* :/home3/cmy/czh/opt/autoconf-2.69/bin/:/home3/cmy/czh/opt/mvapich2-1.9/bin/:/home/cmy/wr/local/ft-mvapich2-1.8a2/bin:/home/cmy/wr/local/mvapich2-1.8a2/bin:/usr/mpi/gcc/mvapich2-1.4.1/bin:/home3/cmy/czh/ompi/bin/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/bin:/home/cmy/huangyb/gem5/swig/bin/:/home/cmy/huangyb/gem5/scons/bin::/home/cmy/huangyb/local/mercurial/bin:/home/cmy/huangyb/local/python-2.7.3/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/usr/mpi/gcc/openmpi-1.4.2/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/home/cmy/tgm/cmake/bin:/usr/local/mvapich2/bin:/usr/local/mpich-pgi/bin:/opt/pgi/linux86-64/7.0-2/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/opt/gridviewnew/pbs//dispatcher-sched//bin:/opt/gridviewnew/pbs//dispatcher-sched//sbin:/opt/gridviewnew/pbs//dispatcher//bin:/opt/gridviewnew/pbs//dispatcher//sbin:/usr/local/bin:/bin:/usr/bin:/home/cmy/zxx/work_spring_2011/iaca-lin32/bin:/home/cmy/bin:/home/tgm/ljj/software/dmidecode-2.11/:/usr/local/oski_2007/include [cmy@gnode100 ~]$ [cmy@gnode100 ~]$ echo $LD_LIBRARY_PATH */home3/cmy/czh/opt/ompi-1.7.1/lib/* :/home3/cmy/czh/opt/mvapich2-1.9/lib/:/home/cmy/wr/local/ft-mvapich2-1.8a2/lib:/home/cmy/wr/local/mvapich2-1.8a2/lib:/usr/mpi/gcc/mvapich2-1.4.1/lib:/home3/cmy/czh/ompi/lib/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib64:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib/:/home/cmy/huangyb/local/python-2.7.3/lib/:/usr/local/lib64:/usr/local/lib:/home/cmy/clc/DRAMSim2:/home/SOFT/intel/Compiler/11.0/083/lib/intel64:/home/cmy/zxx/oski-icc/lib/oski:/usr/mpi/gcc/openmpi-1.4.2/lib/:/usr/lib/python2.4/config:/home/SOFT/intel/Compiler/11.0/083/mkl/lib/em64t:/home/cmy/tgm/hpx/build/linux/lib:/home/cmy/yanjie/boost/lib:/usr/local/mvapich2/lib:/home/cmy/yanjie/qthread/lib:/opt/gridviewnew/pbs//dispatcher//lib::/usr/local/lib64:/usr/local/lib:/home/cmy/zxx/work_spring_2011/iaca-lin32/lib [cmy@gnode100 ~]$ Best Regards Zehan Cui(崔泽汉) --- Institute of Computing Technology, Chinese Academy of Sciences. No.6 Kexueyuan South Road Zhongguancun,Haidian District Beijing,China On Fri, Jun 14, 2013 at 9:32 PM, Ralph Castain wrote: > You aren't setting the path correctly on your backend machines, and so > they are picking up an older version of OMPI. > > On Jun 14, 2013, at 2:08 AM, Zehan Cui wrote: > > > Hi, > > > > I have just install OpenMPI-1.7.1 and cannot get it running. > > > > Here is the error messages: > > > > [cmy@gLoginNode1 test_nbc]$ mpirun -n 4 -host gnode100 ./hello > > [gnode100:31789] Error: unknown option "--tree-spawn" > > input in flex scanner failed > > [gLoginNode1:14
Re: [OMPI users] unknow option "--tree-spawn" with OpenMPI-1.7.1
Thanks. That's exactly the problem. When add prefix to the mpirun command, everything goes fine. - Zehan Cui On Fri, Jun 14, 2013 at 10:25 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com> wrote: > Check the PATH you get when you run non-interactively on the remote > machine: > > ssh gnode100 env | grep PATH > > > On Jun 14, 2013, at 10:09 AM, Zehan Cui wrote: > > > I think the PATH setting is ok. I forgot to mention that it run well on > local machine. > > > > The PATH setting on the local machine is > > > > [cmy@gLoginNode1 ~]$ echo $PATH > > > /home/cmy/clc/benchmarks/nasm-2.09.10:/home3/cmy/czh/opt/ompi-1.7.1/bin/:/home3/cmy/czh/opt/autoconf-2.69/bin/:/home3/cmy/czh/opt/mvapich2-1.9/bin/:/home/cmy/wr/local/ft-mvapich2-1.8a2/bin:/home/cmy/wr/local/mvapich2-1.8a2/bin:/usr/mpi/gcc/mvapich2-1.4.1/bin:/home3/cmy/czh/ompi/bin/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/bin:/home/cmy/huangyb/gem5/swig/bin/:/home/cmy/huangyb/gem5/scons/bin::/home/cmy/huangyb/local/mercurial/bin:/home/cmy/huangyb/local/python-2.7.3/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/usr/mpi/gcc/openmpi-1.4.2/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/home/cmy/tgm/cmake/bin:/usr/local/mvapich2/bin:/usr/local/mpich-pgi/bin:/opt/pgi/linux86-64/7.0-2/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/opt/gridviewnew/pbs//dispatcher-sched//bin:/opt/gridviewnew/pbs//dispatcher-sched//sbin:/opt/gridviewnew/pbs//dispatcher//bin:/opt/gridviewnew/pbs//dispatcher//sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/cmy/zxx/work_spring_2011/iaca-lin32/bin:/home/cmy/bin:/home/tgm/ljj/software/dmidecode-2.11/:/usr/local/oski_2007/include > > [cmy@gLoginNode1 ~]$ echo $LD_LIBRARY_PATH > > > /home3/cmy/czh/opt/ompi-1.7.1/lib/:/home3/cmy/czh/opt/mvapich2-1.9/lib/:/home/cmy/wr/local/ft-mvapich2-1.8a2/lib:/home/cmy/wr/local/mvapich2-1.8a2/lib:/usr/mpi/gcc/mvapich2-1.4.1/lib:/home3/cmy/czh/ompi/lib/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib64:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib/:/home/cmy/huangyb/local/python-2.7.3/lib/:/usr/local/lib64:/usr/local/lib:/home/cmy/clc/DRAMSim2:/home/SOFT/intel/Compiler/11.0/083/lib/intel64:/home/cmy/zxx/oski-icc/lib/oski:/usr/mpi/gcc/openmpi-1.4.2/lib/:/usr/lib/python2.4/config:/home/SOFT/intel/Compiler/11.0/083/mkl/lib/em64t:/home/cmy/tgm/hpx/build/linux/lib:/home/cmy/yanjie/boost/lib:/usr/local/mvapich2/lib:/home/cmy/yanjie/qthread/lib:/opt/gridviewnew/pbs//dispatcher//lib::/usr/local/lib64:/usr/local/lib:/home/cmy/zxx/work_spring_2011/iaca-lin32/lib > > > > > > The path setting on gnode100 is the same too > > > > [cmy@gnode100 ~]$ > > [cmy@gnode100 ~]$ echo $PATH > > > /home/cmy/clc/benchmarks/nasm-2.09.10:/home3/cmy/czh/opt/ompi-1.7.1/bin/:/home3/cmy/czh/opt/autoconf-2.69/bin/:/home3/cmy/czh/opt/mvapich2-1.9/bin/:/home/cmy/wr/local/ft-mvapich2-1.8a2/bin:/home/cmy/wr/local/mvapich2-1.8a2/bin:/usr/mpi/gcc/mvapich2-1.4.1/bin:/home3/cmy/czh/ompi/bin/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/bin:/home/cmy/huangyb/gem5/swig/bin/:/home/cmy/huangyb/gem5/scons/bin::/home/cmy/huangyb/local/mercurial/bin:/home/cmy/huangyb/local/python-2.7.3/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/usr/mpi/gcc/openmpi-1.4.2/bin/:/home/SOFT/intel/Compiler/11.0/083/bin/intel64:/home/cmy/tgm/cmake/bin:/usr/local/mvapich2/bin:/usr/local/mpich-pgi/bin:/opt/pgi/linux86-64/7.0-2/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/opt/gridviewnew/pbs//dispatcher-sched//bin:/opt/gridviewnew/pbs//dispatcher-sched//sbin:/opt/gridviewnew/pbs//dispatcher//bin:/opt/gridviewnew/pbs//dispatcher//sbin:/usr/local/bin:/bin:/usr/bin:/home/cmy/zxx/work_spring_2011/iaca-lin32/bin:/home/cmy/bin:/home/tgm/ljj/software/dmidecode-2.11/:/usr/local/oski_2007/include > > [cmy@gnode100 ~]$ > > [cmy@gnode100 ~]$ echo $LD_LIBRARY_PATH > > > /home3/cmy/czh/opt/ompi-1.7.1/lib/:/home3/cmy/czh/opt/mvapich2-1.9/lib/:/home/cmy/wr/local/ft-mvapich2-1.8a2/lib:/home/cmy/wr/local/mvapich2-1.8a2/lib:/usr/mpi/gcc/mvapich2-1.4.1/lib:/home3/cmy/czh/ompi/lib/:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib64:/home/cmy/huangyb/gem5/gcc/gcc-4.3/lib/:/home/cmy/huangyb/local/python-2.7.3/lib/:/usr/local/lib64:/usr/local/lib:/home/cmy/clc/DRAMSim2:/home/SOFT/intel/Compiler/11.0/083/lib/intel64:/home/cmy/zxx/oski-icc/lib/oski:/usr/mpi/gcc/openmpi-1.4.2/lib/:/usr/lib/python2.4/config:/home/SOFT/intel/Compiler/11.0/083/mkl/lib/em64t:/home/cmy/tgm/hpx/build/linux/lib:/home/cmy/yanjie/boost/lib:/usr/local/mvapich2/lib:/home/cmy/yanjie/qthread/lib:/opt/gridviewnew/pbs//dispatcher//lib::/usr/local/lib64:/usr/local/lib:/home/cmy/zxx/work_spring_2011/iaca-lin32/lib > > [cmy@gnode100 ~]$ > > > > Best Regards > > Zehan Cui(崔泽汉) > > --- > > Institute of Computing Technology, Chinese Academy of Sciences. > > No.6 Kexueyuan South Road
[OMPI users] MPI_Iallgatherv performance
Hi, OpenMPI-1.7.1 is announce support MPI-3 functionality such as non-blocking collectives. I have test MPI_Iallgatherv on a 8-node cluster, however, I got bad performance. The MPI_Iallgatherv block the program for even longer time than traditional MPI_Allgatherv. Following is the test pseudo-code and result. === Using MPI_Allgatherv: for( i=0; i<8; i++ ) { // computation mytime( t_begin ); computation; mytime( t_end ); comp_time += (t_end - t_begin); // communication t_begin = t_end; MPI_Allgatherv(); mytime( t_end ); comm_time += (t_end - t_begin); } result: comp_time = 811,630 us comm_time = 342,284 us Using MPI_Iallgatherv: for( i=0; i<8; i++ ) { // computation mytime( t_begin ); computation; mytime( t_end ); comp_time += (t_end - t_begin); // communication t_begin = t_end; MPI_Iallgatherv(); mytime( t_end ); comm_time += (t_end - t_begin); } // wait for non-blocking allgather to complete mytime( t_begin ); for( i=0; i<8; i++ ) MPI_Wait; mytime( t_end ); wait_time = t_end - t_begin; result: comp_time = 817,397 us comm_time = 1,183,511 us wait_time = 1,294,330 us == >From the result, we can tell that MPI_Iallgatherv block the program for 1,183,511 us, much longer than that of MPI_Allgatherv, which is 342,284 us. Even worse, it still take 1,294,330 us to wait for the non-blocking MPI_Iallgatherv to finish. - Zehan Cui