Re: [OMPI users] openib RETRY EXCEEDED ERROR
Brett Pemberton wrote: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some MVAPICH versions, from which I have concluded that the problem lies in the lower levels (OFED or IB card firmware). Indeed after the installation of OFED 1.3.x and a possible firmware update (not sure about the firmware as I don't admin that cluster), these errors have disappeared. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI users] openib RETRY EXCEEDED ERROR
Bogdan Costescu wrote: Brett Pemberton wrote: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some MVAPICH versions, from which I have concluded that the problem lies in the lower levels (OFED or IB card firmware). Indeed after the installation of OFED 1.3.x and a possible firmware update (not sure about the firmware as I don't admin that cluster), these errors have disappeared. I can confirm this: I had a similar problem over Christmas, for which I asked for help in this list. In fact the problem was not with OpenMPI, but with the OFED stack: an upgrade of the latter (and an upgrade of the firmware, although once again the OFED drivers were complaining about the firmware being too old) fixed the problem. We did both upgrades at once, so as in Brett's case I am not sure which one played the major role. Biagio -- = Dr. Biagio Lucini Department of Physics, Swansea University Singleton Park, SA2 8PP Swansea (UK) Tel. +44 (0)1792 602284 =
[OMPI users] libmpi_f90.so not being built
Hi, I am trying to build openmpi 1.3 on Cent_OS with gcc and the lahey f95 compiler with the following configuration: ./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/lfc --prefix=/opt/openmpi-1.3_lfc When I "make install all" the process fails to build libmpi_f90.la because libmpi_f90.so.0 isn't found (see output at the end of the post). I can't grep any other mention to libmpi_f90.so being built in config.log or on the output from the make and indeed it is not on the build directory with the other shared libraries: [root@server lfc]# find . -name "libmpi*.so*" ./ompi/.libs/libmpi.so ./ompi/.libs/libmpi.so.0 ./ompi/.libs/libmpi.so.0.0.0 ./ompi/.libs/libmpi.so.0.0.0T ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0 ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0T ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0 ./ompi/mpi/cxx/.libs/libmpi_cxx.so ./ompi/mpi/f77/.libs/libmpi_f77.so.0 ./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0 ./ompi/mpi/f77/.libs/libmpi_f77.so ./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0T I believe that shared libraries for f90 bindings should be built by default but even trying to force the f90 bindings with shared libraries didn't do the trick: ./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/lfc F90=/share/apps/lf6481/bin/lfc --prefix=/opt/openmpi-1.3_lfc --enable-shared --with-mpi_f90_size=medium --enable-mpi-f90 Any sugestions of what might be going wrong are most welcome. Thanks, TS [root@server lfc]# tail install.out make[4]: Entering directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[5]: Entering directory `/root/builds/openmpi-1.3/lfc' make[5]: Leaving directory `/root/builds/openmpi-1.3/lfc' /bin/sh ../../../libtool --mode=link /share/apps/lf6481/bin/lfc -I../../../omp i/include -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90 -export-dyn amic -o libmpi_f90.la -rpath /opt/openmpi-1.3_lfc/lib mpi.lo mpi_sizeof.lo mpi _comm_spawn_multiple_f90.lo mpi_testall_f90.lo mpi_testsome_f90.lo mpi_waitall_f 90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo mpi_wtime_f90.lo ../../../ompi/libm pi.la -lnsl -lutil -lm libtool: link: /share/apps/lf6481/bin/lfc -shared .libs/mpi.o .libs/mpi_sizeof. o .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o .libs/mpi_testsome _f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o .libs/mpi_wtick_f90.o .l ibs/mpi_wtime_f90.o-rpath /root/builds/openmpi-1.3/lfc/ompi/.libs -rpath /ro ot/builds/openmpi-1.3/lfc/orte/.libs -rpath /root/builds/openmpi-1.3/lfc/opal/.l ibs -rpath /opt/openmpi-1.3_lfc/lib -L/root/builds/openmpi-1.3/lfc/orte/.libs -L /root/builds/openmpi-1.3/lfc/opal/.libs ../../../ompi/.libs/libmpi.so /root/buil ds/openmpi-1.3/lfc/orte/.libs/libopen-rte.so /root/builds/openmpi-1.3/lfc/opal/. libs/libopen-pal.so -ldl -lnsl -lutil -lm-pthread -soname libmpi_f90.so.0 -o .libs/libmpi_f90.so.0.0.0 ERROR -- Could not find specified object file libmpi_f90.so.0. make[4]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[3]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[2]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[1]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi'
[OMPI users] Problem with cascading derived data types
Hi, In one of my applications I am using cascaded derived MPI datatypes created with MPI_Type_struct. One of these types is used to just send a part (one MPI_Char) of a struct consisting of an int followed by two chars. I.e, the int at the beginning is/should be ignored. This works fine if I use this data type on its own. Unfortunately I need to send another struct that contains an int and the int-char-char struct from above. Again I construct a custom MPI data type for this. When sending this cascaded data type It seems that the offset of the char in the inner custom type is disregarded on the receiving end and the received data ('1') is stored in the first int instead of the following char. I have tested this code with both lam and mpich. There it worked as expected (saving the '1' in the first char). The last two lines of the output of the attached test case read received global=10 attribute=0 (local=1 public=0) received attribute=1 (local=100 public=0) for openmi instead of received global=10 attribute=1 (local=100 public=0) received attribute=1 (local=100 public=0) for lam and mpich. The same problem is experienced when using version 1.3-2 of openmpi. Am I doing something completely wrong or have I accidentally found a bug? Cheers, Markus #include"mpi.h" #include struct LocalIndex { int local_; char attribute_; char public_; }; struct IndexPair { int global_; LocalIndex local_; }; int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(size<2) { std::cerr<<"no procs has to be >2"<
Re: [OMPI users] libmpi_f90.so not being built
Can you please send all the information listed here: http://www.open-mpi.org/community/help/ On Feb 27, 2009, at 6:38 AM, Tiago Silva wrote: Hi, I am trying to build openmpi 1.3 on Cent_OS with gcc and the lahey f95 compiler with the following configuration: ./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/ lfc --prefix=/opt/openmpi-1.3_lfc When I "make install all" the process fails to build libmpi_f90.la because libmpi_f90.so.0 isn't found (see output at the end of the post). I can't grep any other mention to libmpi_f90.so being built in config.log or on the output from the make and indeed it is not on the build directory with the other shared libraries: [root@server lfc]# find . -name "libmpi*.so*" ./ompi/.libs/libmpi.so ./ompi/.libs/libmpi.so.0 ./ompi/.libs/libmpi.so.0.0.0 ./ompi/.libs/libmpi.so.0.0.0T ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0 ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0.0.0T ./ompi/mpi/cxx/.libs/libmpi_cxx.so.0 ./ompi/mpi/cxx/.libs/libmpi_cxx.so ./ompi/mpi/f77/.libs/libmpi_f77.so.0 ./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0 ./ompi/mpi/f77/.libs/libmpi_f77.so ./ompi/mpi/f77/.libs/libmpi_f77.so.0.0.0T I believe that shared libraries for f90 bindings should be built by default but even trying to force the f90 bindings with shared libraries didn't do the trick: ./configure F77=/share/apps/lf6481/bin/lfc FC=/share/apps/lf6481/bin/ lfc F90=/share/apps/lf6481/bin/lfc --prefix=/opt/openmpi-1.3_lfc -- enable-shared --with-mpi_f90_size=medium --enable-mpi-f90 Any sugestions of what might be going wrong are most welcome. Thanks, TS [root@server lfc]# tail install.out make[4]: Entering directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/ f90' make[5]: Entering directory `/root/builds/openmpi-1.3/lfc' make[5]: Leaving directory `/root/builds/openmpi-1.3/lfc' /bin/sh ../../../libtool --mode=link /share/apps/lf6481/bin/lfc - I../../../omp i/include -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90 - export-dyn amic -o libmpi_f90.la -rpath /opt/openmpi-1.3_lfc/lib mpi.lo mpi_sizeof.lo mpi _comm_spawn_multiple_f90.lo mpi_testall_f90.lo mpi_testsome_f90.lo mpi_waitall_f 90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo mpi_wtime_f90.lo ../../../ ompi/libm pi.la -lnsl -lutil -lm libtool: link: /share/apps/lf6481/bin/lfc -shared .libs/mpi.o .libs/ mpi_sizeof. o .libs/mpi_comm_spawn_multiple_f90.o .libs/mpi_testall_f90.o .libs/ mpi_testsome _f90.o .libs/mpi_waitall_f90.o .libs/mpi_waitsome_f90.o .libs/ mpi_wtick_f90.o .l ibs/mpi_wtime_f90.o-rpath /root/builds/openmpi-1.3/lfc/ ompi/.libs -rpath /ro ot/builds/openmpi-1.3/lfc/orte/.libs -rpath /root/builds/openmpi-1.3/ lfc/opal/.l ibs -rpath /opt/openmpi-1.3_lfc/lib -L/root/builds/openmpi-1.3/lfc/ orte/.libs -L /root/builds/openmpi-1.3/lfc/opal/.libs ../../../ompi/.libs/ libmpi.so /root/buil ds/openmpi-1.3/lfc/orte/.libs/libopen-rte.so /root/builds/ openmpi-1.3/lfc/opal/. libs/libopen-pal.so -ldl -lnsl -lutil -lm-pthread -soname libmpi_f90.so.0 -o .libs/libmpi_f90.so.0.0.0 ERROR -- Could not find specified object file libmpi_f90.so.0. make[4]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[3]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[2]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi/mpi/f90' make[1]: Leaving directory `/root/builds/openmpi-1.3/lfc/ompi' ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] libmpi_f90.so not being built
ok, here is the complete output in the tgz file attached. The output is slightly different as I am now only using "make all" and not installing. I did a full "make clean" and "rm -fr /*" and the already exists but is empty. Thanks ts-output.tgz Description: Binary data
[OMPI users] more XGrid Problems with openmpi1.2.9
Hi It seems to me more like time issues. All the runs end with something similar to Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x45485308 Crashed Thread: 0 Thread 0 Crashed: 0 libSystem.B.dylib 0x95208f04 strcmp + 84 1 libopen-rte.0.dylib 0x000786fd orte_pls_base_get_active_daemons + 45 2 mca_pls_xgrid.so 0x00271725 orte_pls_xgrid_terminate_orteds + 117 (pls_xgrid_module.m:133) 3 mpirun 0x20ec orterun + 1896 (orterun.c:468) 4 mpirun 0x1982 main + 24 (main.c:14) 5 mpirun 0x193e start + 54 Thread 1: A simple mpirun -n 4 ring give the following results Process 0 sending 10 to1 tag 201 ( 4 processes in ring) Process1 exiting Process2 exiting Process3 exiting Process 0 sent to1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process0 exiting [nexus11:38502] *** Process received signal *** [nexus11:38502] Signal: Segmentation fault (11) [nexus11:38502] Signal code: Address not mapped (1) [nexus11:38502] Failing at address: 0x45485308 [nexus11:38502] [ 0] 2 libSystem.B.dylib 0x9526c2bb _sigtramp + 43 [nexus11:38502] [ 1] 3 ??? 0x 0x0 + 4294967295 [nexus11:38502] [ 2] 4 libopen-rte.0.dylib 0x000786fd orte_pls_base_get_active_daemons + 45 [nexus11:38502] [ 3] 5 mca_pls_xgrid.so0x00271725 orte_pls_xgrid_terminate_orteds + 117 [nexus11:38502] [ 4] 6 mpirun 0x20ec orterun + 1896 [nexus11:38502] [ 5] 7 mpirun 0x1982 main + 24 [nexus11:38502] [ 6] 8 mpirun 0x193e start + 54 [nexus11:38502] [ 7] 9 ??? 0x0004 0x0 + 4 [nexus11:38502] *** End of error message *** Segmentation fault Any idea of what I can do? Ricardo my ompi_info is Open MPI: 1.2.9 Open MPI SVN revision: r20259 Open RTE: 1.2.9 Open RTE SVN revision: r20259 OPAL: 1.2.9 OPAL SVN revision: r20259 Prefix: /opt/openmpi Configured architecture: i386-apple-darwin9.6.0 Configured by: sofhtest Configured on: Fri Feb 27 11:02:30 CET 2009 Configure host: nexus10.nlroc Built by: sofhtest Built on: Fri Feb 27 12:00:08 CET 2009 Built host: nexus10.nlroc C bindings: yes C++ bindings: yes Fortran77 bindings: yes (single underscore) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc-4.2 C compiler absolute: /usr/bin/gcc-4.2 C++ compiler: g++-4.2 C++ compiler absolute: /usr/bin/g++-4.2 Fortran77 compiler: gfortran-4.2 Fortran77 compiler abs: /usr/bin/gfortran-4.2 Fortran90 compiler: gfortran-4.2 Fortran90 compiler abs: /usr/bin/gfortran-4.2 C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.9) MCA memory: darwin (MCA v1.0, API v1.0, Component v1.2.9) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.9) MCA timer: darwin (MCA v1.0, API v1.0, Component v1.2.9) MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.9) MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.9) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: self (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.9) MCA io: romio (MCA v1.0, API v1.0, Component v1.2.9) MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.9) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.9) MCA p
[OMPI users] Fwd: more XGrid Problems with openmpi1.2.9 (error find)
Find the problem in orte_pls_xgrid_terminate_orteds orte_pls_base_get_active_daemons is been call as orte_pls_base_get_active_daemons(&daemons, jobid) when the correct way of doing it is orte_pls_base_get_active_daemons(&daemons, jobid, attrs) yours. Ricardo Hi It seems to me more like time issues. All the runs end with something similar to Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x45485308 Crashed Thread: 0 Thread 0 Crashed: 0 libSystem.B.dylib 0x95208f04 strcmp + 84 1 libopen-rte.0.dylib 0x000786fd orte_pls_base_get_active_daemons + 45 2 mca_pls_xgrid.so 0x00271725 orte_pls_xgrid_terminate_orteds + 117 (pls_xgrid_module.m:133) 3 mpirun 0x20ec orterun + 1896 (orterun.c:468) 4 mpirun 0x1982 main + 24 (main.c:14) 5 mpirun 0x193e start + 54 Thread 1: A simple mpirun -n 4 ring give the following results Process 0 sending 10 to1 tag 201 ( 4 processes in ring) Process1 exiting Process2 exiting Process3 exiting Process 0 sent to1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process0 exiting [nexus11:38502] *** Process received signal *** [nexus11:38502] Signal: Segmentation fault (11) [nexus11:38502] Signal code: Address not mapped (1) [nexus11:38502] Failing at address: 0x45485308 [nexus11:38502] [ 0] 2 libSystem.B.dylib 0x9526c2bb _sigtramp + 43 [nexus11:38502] [ 1] 3 ??? 0x 0x0 + 4294967295 [nexus11:38502] [ 2] 4 libopen-rte.0.dylib 0x000786fd orte_pls_base_get_active_daemons + 45 [nexus11:38502] [ 3] 5 mca_pls_xgrid.so0x00271725 orte_pls_xgrid_terminate_orteds + 117 [nexus11:38502] [ 4] 6 mpirun 0x20ec orterun + 1896 [nexus11:38502] [ 5] 7 mpirun 0x1982 main + 24 [nexus11:38502] [ 6] 8 mpirun 0x193e start + 54 [nexus11:38502] [ 7] 9 ??? 0x0004 0x0 + 4 [nexus11:38502] *** End of error message *** Segmentation fault Any idea of what I can do? Ricardo my ompi_info is Open MPI: 1.2.9 Open MPI SVN revision: r20259 Open RTE: 1.2.9 Open RTE SVN revision: r20259 OPAL: 1.2.9 OPAL SVN revision: r20259 Prefix: /opt/openmpi Configured architecture: i386-apple-darwin9.6.0 Configured by: sofhtest Configured on: Fri Feb 27 11:02:30 CET 2009 Configure host: nexus10.nlroc Built by: sofhtest Built on: Fri Feb 27 12:00:08 CET 2009 Built host: nexus10.nlroc C bindings: yes C++ bindings: yes Fortran77 bindings: yes (single underscore) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc-4.2 C compiler absolute: /usr/bin/gcc-4.2 C++ compiler: g++-4.2 C++ compiler absolute: /usr/bin/g++-4.2 Fortran77 compiler: gfortran-4.2 Fortran77 compiler abs: /usr/bin/gfortran-4.2 Fortran90 compiler: gfortran-4.2 Fortran90 compiler abs: /usr/bin/gfortran-4.2 C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.9) MCA memory: darwin (MCA v1.0, API v1.0, Component v1.2.9) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.9) MCA timer: darwin (MCA v1.0, API v1.0, Component v1.2.9) MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.9) MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.9) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: self (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.9) MCA coll: tuned (MCA
Re: [OMPI users] Latest SVN failures
I just tried trunk-1.4a1r20458 and I did not see this error, although my configuration was rather different. I ran across 100 2-CPU sparc nodes, np=256, connected with TCP. Hopefully George's comment helps out with this issue. One other thought to see whether SGE has anything to do with this is create a hostfile and run it outside of SGE. Rolf On 02/26/09 22:10, Ralph Castain wrote: FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and everything checks out fine. However, this is running under SGE and thus using qrsh, so it is possible the SGE support is having a problem. Perhaps one of the Sun OMPI developers can help here? Ralph On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote: It looks like the system doesn't know what nodes the procs are to be placed upon. Can you run this with --display-devel-map? That will tell us where the system thinks it is placing things. Thanks Ralph On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote: Maybe it's my pine mailer. This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD shangai nodes running a standard benchmark called stmv. The basic error message, which occurs 31 times is like: [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595 The mpirun command has long paths in it, sorry. It's invoking a special binding script which in turn lauches the NAMD run. This works on an older SVN at level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 256 proc run where the older SVN hangs indefinitely polling some completion (sm or openib). So, I was trying later SVNs with this 256 proc run, hoping the error would go away. Here's some of the invocation again. Hope you can read it: EAGER_SIZE=32767 export OMPI_MCA_btl_openib_use_eager_rdma=0 export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE and, unexpanded mpirun --prefix $PREFIX -np %PE% $MCA -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd and, expanded mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd This is all via Sun Grid Engine. The OS as indicated above is SuSE SLES 10 SP2. DM On Thu, 26 Feb 2009, Ralph Castain wrote: I'm sorry, but I can't make any sense of this message. Could you provide a little explanation of what you are doing, what the system looks like, what is supposed to happen, etc? I can barely parse your cmd line... Thanks Ralph On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote: Today's and yesterdays. 1.4a1r20643_svn + mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6 4/opteron -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48 269.1.all.q/newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IM SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li nux-amd64-MPI/namd2 stmv.namd [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls_ base_default_fns.c at line 595 [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 Made with INTEL compilers 10.1.015. Regards, Mostyn ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list
Re: [OMPI users] 3.5 seconds before application launches
Hello, and thanks for both replies, I've tried to run non-mpi program but i still measured some latency time before starting, something around 2 seconds this time. SSH should be properly configured, in fact i can login to both machines without password; openmpi and mvapich use ssh as default. i've tried these commands mpirun --mca btl ^sm -np 2 -host node0 -host node1 ./graph mpirun --mca btl openib,self -np 2 -host node0 -host node1 ./graph and, apart a slight performance increase in the ^sm benchmark, the latency time is the same this is really strange, but i can't figure out the source! do you have any other ideas? thanks Vittorio List-Post: users@lists.open-mpi.org Date: Wed, 25 Feb 2009 20:20:51 -0500 From: Jeff Squyres Subject: Re: [OMPI users] 3.5 seconds before application launches To: Open MPI Users Message-ID: <86d3b246-1866-4b84-b05c-4d13659f8...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Dorian raises a good point. You might want to try some simple tests of launching non-MPI codes (e.g., hostname, uptime, etc.) and see how they fare. Those will more accurately depict OMPI's launching speeds. Getting through MPI_INIT is another matter (although on 2 nodes, the startup should be pretty darn fast). Two other things that *may* impact you: 1. Is your ssh speed between the machines slow? OMPI uses ssh by default, but will fall back to rsh (or you can force rsh if you want). MVAPICH may use rsh by default...? (I don't actually know) 2. OMPI may be spending time creating shared memory files. You can disable OMPI's use of shared memory by running with: mpirun --mca btl ^sm ... Meaning "use anything except the 'sm' (shared memory) transport for MPI messages". On Feb 25, 2009, at 4:01 PM, doriankrause wrote: > Vittorio wrote: >> Hi! >> I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm >> using >> Gentoo Linux x86_64. >> >> I've noticed that before any application starts there is a variable >> amount >> of time (around 3.5 seconds) in which the terminal just hangs with >> no output >> and then the application starts and works well. >> >> I imagined that there might have been some initialization routine >> somewhere >> in the Infiniband layer or in the software stack, but as i >> continued my >> tests i observed that this "latency" time is not present in other MPI >> implementations (like mvapich2) where my application starts >> immediately (but >> performs worse). >> >> Is my MPI configuration/installation broken or is this expected >> behaviour? >> > > Hi, > > I'm not really qualified to answer this question, but I know that in > contrast > to other MPI implementations (MPICH) the modular structure of Open > MPI is based > on shared libs that are dlopened at the startup. As symbol > relocation can be > costly this might be a reason why the startup time is higher. > > Have you checked wether this is an mpiexec start issue or the > MPI_Init call? > > Regards, > Dorian > >> thanks a lot! >> Vittorio >> >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] TCP instead of openIB doesn't work
Hello, i'm posting here another problem of my installation I wanted to benchmark the differences between tcp and openib transport if i run a simple non mpi application i get randori ~ # mpirun --mca btl tcp,self -np 2 -host randori -host tatami hostname randori tatami but as soon as i switch to my benchmark program i have mpirun --mca btl tcp,self -np 2 -host randori -host tatami graph Master thread reporting matrix size 33554432 kB, time is in [us] and instead of starting the send/receive functions it just hangs there; i also checked the transmitted packets with wireshark but after the handshake no more packets are exchanged I read in the archives that there were some problems in this area and so i tried what was suggested in previous emails mpirun --mca btl ^openib -np 2 -host randori -host tatami graph mpirun --mca pml ob1 --mca btl tcp,self -np 2 -host randori -host tatami graph gives exactly the same output as before (no mpisend/receive) while the next commands gives something more interesting mpirun --mca pml cm --mca btl tcp,self -np 2 -host randori -host tatami graph -- No available pml components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- [tatami:06619] PML cm cannot be selected mpirun noticed that job rank 0 with PID 6710 on node randori exited on signal 15 (Terminated). which is not possible as if i do ompi_info --param all there is the CM pml component MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8) my test program is quite simple, just a couple of MPI_Send and MPI_Recv (just after the signature) do you have any ideas that might help me? thanks a lot Vittorio #include "mpi.h" #include #include #include #include #define M_COL 4096 #define M_ROW 524288 #define NUM_MSG 25 unsigned long int gigamatrix[M_ROW][M_COL]; int main (int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tmp, count, tag=1; unsigned long int exp, exchanged; unsigned long int i, j, e; unsigned long matsize; MPI_Status Stat; struct timeval timing_start, timing_end; double inittime = 0; long int totaltime = 0; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numtasks); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) { fprintf (stderr, "Master thread reporting\n", numtasks - 1); matsize = (long) M_COL * M_ROW / 64; fprintf (stderr, "matrix size %d kB, time is in [us]\n", matsize); source = 1; dest = 1; /*warm up phase*/ rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); for (e = 0; e < NUM_MSG; e++) { exp = pow (2, e); exchanged = 64 * exp; /*timing of ops*/ gettimeofday (&timing_start, NULL); rc = MPI_Send (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, source, tag, MPI_COMM_WORLD, &Stat); gettimeofday (&timing_end, NULL); totaltime = (timing_end.tv_sec - timing_start.tv_sec) * 100 + (timing_end.tv_usec - timing_start.tv_usec); memset (&timing_start, 0, sizeof(struct timeval)); memset (&timing_end, 0, sizeof(struct timeval)); fprintf (stdout, "%d kB\t%d\n", exp, totaltime); } fprintf(stderr, "task complete\n"); } else { if (rank >= 1) { dest = 0; source = 0; rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat);
Re: [OMPI users] TCP instead of openIB doesn't work
I'm not entirely sure what is causing the problem here, but one thing does stand out. You have specified two -host options for the same application - this is not our normal syntax. The usual way of specifying this would be: mpirun --mca btl tcp,self -np 2 -host randori,tatami hostname I'm not entirely sure what OMPI does when it gets two separate -host arguments - could be equivalent to the above syntax, but could also cause some unusual behavior. Could you retry your job with the revised syntax? Also, could you add --display-map to your mpirun cmd line? This will tell us where OMPI thinks the procs are going, and a little info about how it interpreted your cmd line. Thanks Ralph On Feb 27, 2009, at 8:00 AM, Vittorio Giovara wrote: Hello, i'm posting here another problem of my installation I wanted to benchmark the differences between tcp and openib transport if i run a simple non mpi application i get randori ~ # mpirun --mca btl tcp,self -np 2 -host randori -host tatami hostname randori tatami but as soon as i switch to my benchmark program i have mpirun --mca btl tcp,self -np 2 -host randori -host tatami graph Master thread reporting matrix size 33554432 kB, time is in [us] and instead of starting the send/receive functions it just hangs there; i also checked the transmitted packets with wireshark but after the handshake no more packets are exchanged I read in the archives that there were some problems in this area and so i tried what was suggested in previous emails mpirun --mca btl ^openib -np 2 -host randori -host tatami graph mpirun --mca pml ob1 --mca btl tcp,self -np 2 -host randori -host tatami graph gives exactly the same output as before (no mpisend/receive) while the next commands gives something more interesting mpirun --mca pml cm --mca btl tcp,self -np 2 -host randori -host tatami graph -- No available pml components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- [tatami:06619] PML cm cannot be selected mpirun noticed that job rank 0 with PID 6710 on node randori exited on signal 15 (Terminated). which is not possible as if i do ompi_info --param all there is the CM pml component MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8) my test program is quite simple, just a couple of MPI_Send and MPI_Recv (just after the signature) do you have any ideas that might help me? thanks a lot Vittorio #include "mpi.h" #include #include #include #include #define M_COL 4096 #define M_ROW 524288 #define NUM_MSG 25 unsigned long int gigamatrix[M_ROW][M_COL]; int main (int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tmp, count, tag=1; unsigned long int exp, exchanged; unsigned long int i, j, e; unsigned long matsize; MPI_Status Stat; struct timeval timing_start, timing_end; double inittime = 0; long int totaltime = 0; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numtasks); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) { fprintf (stderr, "Master thread reporting\n", numtasks - 1); matsize = (long) M_COL * M_ROW / 64; fprintf (stderr, "matrix size %d kB, time is in [us]\n", matsize); source = 1; dest = 1; /*warm up phase*/ rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); for (e = 0; e < NUM_MSG; e++) { exp = pow (2, e); exchanged = 64 * exp; /*timing of ops*/ gettimeofday (&timing_start, NULL); rc = MPI_Send (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, source, tag, MPI_COMM_WORLD, &Stat); gettimeofday (&timing_end, NULL); totaltime = (timing_end.tv_sec - timing_start.tv_sec) * 100 + (timing_end.tv_usec - timing_start.t
Re: [OMPI users] openib RETRY EXCEEDED ERROR
2009/2/26 Brett Pemberton : > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 38996224 opcode 0 qp_idx 0 What OS are you using? I've seen this error and many other Infiniband related errors on RedHat enterprise linux 4 update 4, with ConnectX cards and various versions of OFED, up to version 1.3. Depending on the MCA parameters, I also see hangs often enough to make native Infiniband unusable on this OS. However, the openib btl works just fine on the same hardware and the same OFED/OpenMPI stack when used with Centos 4.6. I suspect there may be something about the kernel that is contributing to these problems, but I haven't had a chance to test the kernel from 4.6 on 4.4. mch
Re: [OMPI users] openib RETRY EXCEEDED ERROR
On Fri, 2009-02-27 at 09:54 -0700, Matt Hughes wrote: > 2009/2/26 Brett Pemberton : > > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org > > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status > > number 12 for wr_id 38996224 opcode 0 qp_idx 0 > > What OS are you using? I've seen this error and many other Infiniband > related errors on RedHat enterprise linux 4 update 4, with ConnectX > cards and various versions of OFED, up to version 1.3. Depending on > the MCA parameters, I also see hangs often enough to make native > Infiniband unusable on this OS. > > However, the openib btl works just fine on the same hardware and the > same OFED/OpenMPI stack when used with Centos 4.6. I suspect there > may be something about the kernel that is contributing to these > problems, but I haven't had a chance to test the kernel from 4.6 on > 4.4. We see these errors fairly frequently on our CentOS 5.2 system with Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2 uses. Has anyone tested that with the 1.4 OFED stack? -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Re: [OMPI users] openib RETRY EXCEEDED ERROR
Usually "retry exceeded error" points to some network issues, like bad cable or some bad connector. You may use ibdiagnet tool for the network debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. Pasha Brett Pemberton wrote: Hey, I've had a couple of errors recently, of the form: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 -- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): My first thought was to increase the retry count, but it is already at maximum. I've checked connections between the two nodes, and they seem ok [root@tango090 ~]# ibv_rc_pingpong local address: LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0 remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2 8192000 bytes in 0.07 seconds = 996.93 Mbit/sec 1000 iters in 0.07 seconds = 65.74 usec/iter How can I stop this happening in the future, without increasing the retry count? cheers, / Brett ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] defining different values for same environment variable
Hello I am looking for a way to set environment variable with different value on each node before running MPI executable. (not only export the environment variable !) Let's consider that I have cluster with two nodes (n001 and n002) and I want to set the environment variable GMON_OUT_PREFIX with different value on each node. I would like to set this variable with the following form : *nicolas@n001 % setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n` nicolas@n001 % echo $GMON_OUT_PREFIX gmon.out_n001* my mpirun command looks like : *nicolas@n001 % cat CLUSTER_NODES n001 slots=1 n002 slots=1 nicolas@n001 % mpirun -np 2 --bynode --hostfile CLUSTER_NODES myexe* I would like to export the GMON_OUT_PREFIX environment variable in order to have « gmon.out_n001 » on first node and « gmon.out_n002 » on second one. I cannot use '-x' option of mpirun command since it is only use to export (not set) variable. MPI executable is running on cluster nodes where HOME directory is not mounted such that I cannot use ~/.cshrc file. Is there another way to do it? Regards Nicolas
Re: [OMPI users] defining different values for same environment variable
2009/2/27 Nicolas Deladerriere : > I am looking for a way to set environment variable with different value on > each node before running MPI executable. (not only export the environment > variable !) I typically use a script for things like this. So instead of specifying your executable directly on the mpirun command line, instead specify the script. The script can set the environment variable, then launch your executable. #!/bin/csh setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n` myexe mpirun -np 2 --bynode --hostfile CLUSTER_NODES myscript I'm not sure if that csh syntax is right, but you get the idea. mch
[OMPI users] Threading fault
Dear All, I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have built my application. The linux box is 2Xamd64 quad. In the middle of running of my application (after some 15 iterations), I receive the message and stops. I tried to configure openmpi using "--disable-mpi-threads" but it automatically assumes "posix". This problem does not happen in openmpi-1.2.9. Any comment is highly appreciated. Best regards, mahmoud payami [hpc1:25353] *** Process received signal *** [hpc1:25353] Signal: Segmentation fault (11) [hpc1:25353] Signal code: Address not mapped (1) [hpc1:25353] Failing at address: 0x51 [hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40] [hpc1:25353] [ 1] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e350d96] [hpc1:25353] [ 2] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e3514a8] [hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so [0x2eb7c72a] [hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so.0(opal_progress+0x89) [0x2b42b7d9] [hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e34d27c] [hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv+0x210) [0x2af46010] [hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv+0xa4) [0x2acd6af4] [hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_+0x13da) [0x513d8a] [hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff] [hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e] [hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7] [hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f] [hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb] [hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc] [hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4] [hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9] [hpc1:25353] *** End of error message *** -- mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on signal 11 (Segmentation fault). --
Re: [OMPI users] valgrind problems
On Thu, Feb 26, 2009 at 08:27:15PM -0700, Justin wrote: > Also the stable version of openmpi on Debian is 1.2.7rc2. Are there any > known issues with this version and valgrid? For a now-forgotten reason, I ditched the openmpi that comes on Debian etch, and installed 1.2.8 in /usr/local. HTH, Douglas.
[OMPI users] threading bug?
Dear All, I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have built my application. The linux box is 2Xamd64 quad. In the middle of running of my application (after some 15 iterations), I receive the message and stops. I tried to configure openmpi using "--disable-mpi-threads" but it automatically assumes "posix". This problem does not happen in openmpi-1.2.9. Any comment is highly appreciated. Best regards, mahmoud payami [hpc1:25353] *** Process received signal *** [hpc1:25353] Signal: Segmentation fault (11) [hpc1:25353] Signal code: Address not mapped (1) [hpc1:25353] Failing at address: 0x51 [hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40] [hpc1:25353] [ 1] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e350d96] [hpc1:25353] [ 2] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e3514a8] [hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so [0x2eb7c72a] [hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so.0(opal_progress+0x89) [0x2b42b7d9] [hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2e34d27c] [hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv+0x210) [0x2af46010] [hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv+0xa4) [0x2acd6af4] [hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_+0x13da) [0x513d8a] [hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff] [hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e] [hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7] [hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f] [hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb] [hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc] [hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4] [hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9] [hpc1:25353] *** End of error message *** -- mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on signal 11 (Segmentation fault). --
Re: [OMPI users] defining different values for same environment variable
Matt, Thanks for your solution, but I thought about that and it is not really convenient in my configuration to change the executable on each node. I would like to change only mpirun command. 2009/2/27 Matt Hughes > > 2009/2/27 Nicolas Deladerriere : > > I am looking for a way to set environment variable with different value > on > > each node before running MPI executable. (not only export the environment > > variable !) > > I typically use a script for things like this. So instead of > specifying your executable directly on the mpirun command line, > instead specify the script. The script can set the environment > variable, then launch your executable. > > #!/bin/csh > setenv GMON_OUT_PREFIX 'gmon.out_'`/bin/uname -n` > myexe > > mpirun -np 2 --bynode --hostfile CLUSTER_NODES myscript > > I'm not sure if that csh syntax is right, but you get the idea. > > mch > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OMPI, and HPUX
I don't know if anyone has tried OMPI on HP-UX, sorry. On Feb 26, 2009, at 9:14 AM, Nader wrote: Hello, Does anyone has installed OMPI on a HPUX system? I do apprciate any info. Best Regards. Nader ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Latest SVN failures
With further investigation, I have reproduced this problem. I think I was originally testing against a version that was not recent enough. I do not see it with r20594 which is from February 19. So, something must have happened over the last 8 days. I will try and narrow down the issue. Rolf On 02/27/09 09:34, Rolf Vandevaart wrote: I just tried trunk-1.4a1r20458 and I did not see this error, although my configuration was rather different. I ran across 100 2-CPU sparc nodes, np=256, connected with TCP. Hopefully George's comment helps out with this issue. One other thought to see whether SGE has anything to do with this is create a hostfile and run it outside of SGE. Rolf On 02/26/09 22:10, Ralph Castain wrote: FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and everything checks out fine. However, this is running under SGE and thus using qrsh, so it is possible the SGE support is having a problem. Perhaps one of the Sun OMPI developers can help here? Ralph On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote: It looks like the system doesn't know what nodes the procs are to be placed upon. Can you run this with --display-devel-map? That will tell us where the system thinks it is placing things. Thanks Ralph On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote: Maybe it's my pine mailer. This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD shangai nodes running a standard benchmark called stmv. The basic error message, which occurs 31 times is like: [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595 The mpirun command has long paths in it, sorry. It's invoking a special binding script which in turn lauches the NAMD run. This works on an older SVN at level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 256 proc run where the older SVN hangs indefinitely polling some completion (sm or openib). So, I was trying later SVNs with this 256 proc run, hoping the error would go away. Here's some of the invocation again. Hope you can read it: EAGER_SIZE=32767 export OMPI_MCA_btl_openib_use_eager_rdma=0 export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE and, unexpanded mpirun --prefix $PREFIX -np %PE% $MCA -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd and, expanded mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd This is all via Sun Grid Engine. The OS as indicated above is SuSE SLES 10 SP2. DM On Thu, 26 Feb 2009, Ralph Castain wrote: I'm sorry, but I can't make any sense of this message. Could you provide a little explanation of what you are doing, what the system looks like, what is supposed to happen, etc? I can barely parse your cmd line... Thanks Ralph On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote: Today's and yesterdays. 1.4a1r20643_svn + mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6 4/opteron -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48 269.1.all.q/newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IM SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li nux-amd64-MPI/namd2 stmv.namd [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls_ base_default_fns.c at line 595 [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls _base_default_fns.c at line 595 Made with INTEL compilers 10.1.015. Regards, Mostyn ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users __
Re: [OMPI users] openib RETRY EXCEEDED ERROR
On Feb 27, 2009, at 12:09 PM, Åke Sandgren wrote: We see these errors fairly frequently on our CentOS 5.2 system with Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2 uses. Has anyone tested that with the 1.4 OFED stack? FWIW, I have tested OMPI's openib BTL with several different versions of the OFED stack: 1.3, 1.3.1, 1.4, etc. I used various flavors of RHEL4 and RHEL5. For a variety of uninteresting reasons, I usually uninstall the OS- installed verbs stack/drivers and install OFED. So I don't have much experience with the verbs stacks/drivers that ship with the various distros. -- Jeff Squyres Cisco Systems
Re: [OMPI users] TCP instead of openIB doesn't work
Hello, i ve corrected the syntax and added the flag you suggested, but unfortunately the result doen't change. randori ~ # mpirun --display-map --mca btl tcp,self -np 2 -host randori,tatami graph [randori:22322] Map for job: 1Generated by mapping mode: byslot Starting vpid: 0Vpid range: 2Num app_contexts: 1 Data for app_context: index 0app: graph Num procs: 2 Argv[0]: graph Env[0]: OMPI_MCA_btl=tcp,self Env[1]: OMPI_MCA_rmaps_base_display_map=1 Env[2]: OMPI_MCA_orte_precondition_transports=d45d47f6e1ed0e0b-691fd7f24609dec3 Env[3]: OMPI_MCA_rds=proxy Env[4]: OMPI_MCA_ras=proxy Env[5]: OMPI_MCA_rmaps=proxy Env[6]: OMPI_MCA_pls=proxy Env[7]: OMPI_MCA_rmgr=proxy Working dir: /root (user: 0) Num maps: 1 Data for app_context_map: Type: 1Data: randori,tatami Num elements in nodes list: 2 Mapped node: Cell: 0Nodename: randoriLaunch id: -1Username: NULL Daemon name: Data type: ORTE_PROCESS_NAMEData Value: NULL Oversubscribed: FalseNum elements in procs list: 1 Mapped proc: Proc Name: Data type: ORTE_PROCESS_NAMEData Value: [0,1,0] Proc Rank: 0Proc PID: 0App_context index: 0 Mapped node: Cell: 0Nodename: tatamiLaunch id: -1Username: NULL Daemon name: Data type: ORTE_PROCESS_NAMEData Value: NULL Oversubscribed: FalseNum elements in procs list: 1 Mapped proc: Proc Name: Data type: ORTE_PROCESS_NAMEData Value: [0,1,1] Proc Rank: 1Proc PID: 0App_context index: 0 Master thread reporting matrix size 33554432 kB, time is in [us] (and then it just hangs) Vittorio On Fri, Feb 27, 2009 at 6:00 PM, wrote: > > Date: Fri, 27 Feb 2009 08:22:17 -0700 > From: Ralph Castain > Subject: Re: [OMPI users] TCP instead of openIB doesn't work > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > I'm not entirely sure what is causing the problem here, but one thing > does stand out. You have specified two -host options for the same > application - this is not our normal syntax. The usual way of > specifying this would be: > > mpirun --mca btl tcp,self -np 2 -host randori,tatami hostname > > I'm not entirely sure what OMPI does when it gets two separate -host > arguments - could be equivalent to the above syntax, but could also > cause some unusual behavior. > > Could you retry your job with the revised syntax? Also, could you add > --display-map to your mpirun cmd line? This will tell us where OMPI > thinks the procs are going, and a little info about how it interpreted > your cmd line. > > Thanks > Ralph > > > On Feb 27, 2009, at 8:00 AM, Vittorio Giovara wrote: > > > Hello, i'm posting here another problem of my installation > > I wanted to benchmark the differences between tcp and openib transport > > > > if i run a simple non mpi application i get > > randori ~ # mpirun --mca btl tcp,self -np 2 -host randori -host > > tatami hostname > > randori > > tatami > > > > but as soon as i switch to my benchmark program i have > > mpirun --mca btl tcp,self -np 2 -host randori -host tatami graph > > Master thread reporting > > matrix size 33554432 kB, time is in [us] > > > > and instead of starting the send/receive functions it just hangs > > there; i also checked the transmitted packets with wireshark but > > after the handshake no more packets are exchanged > > > > I read in the archives that there were some problems in this area > > and so i tried what was suggested in previous emails > > > > mpirun --mca btl ^openib -np 2 -host randori -host tatami graph > > mpirun --mca pml ob1 --mca btl tcp,self -np 2 -host randori -host > > tatami graph > > > > gives exactly the same output as before (no mpisend/receive) > > while the next commands gives something more interesting > > > > mpirun --mca pml cm --mca btl tcp,self -np 2 -host randori -host > > tatami graph > > > -- > > No available pml components were found! > > > > This means that there are no components of this type installed on your > > system or all the components reported that they could not be used. > > > > This is a fatal error; your MPI process is likely to abort. Check the > > output of the "ompi_info" command and ensure that components of this > > type are available on your system. You may also wish to check the > > value of the "component_path" MCA parameter and ensure that it has at > > least one directory that contains valid MCA components. > > > > > -- > > [tatami:06619] PML cm cannot be selected > > mpirun noticed that job rank 0 with PID 6710 on node randori exit
Re: [OMPI users] Latest SVN failures
Unfortunately, I think I have reproduced the problem as well -- with SVN trunk HEAD (r20655): [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack failed in file base/odls_base_default_fns.c at line 566 -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- Notice that I'm not trying to run an MPI app -- it's just "uptime". The following things seem to be necessary to make this error occur for me: 1. --bynode 2. set some mca parameter (any mca parameter) 3. -np value less than the size of my slurm allocation If I remove any of those, it seems to run file On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote: With further investigation, I have reproduced this problem. I think I was originally testing against a version that was not recent enough. I do not see it with r20594 which is from February 19. So, something must have happened over the last 8 days. I will try and narrow down the issue. Rolf On 02/27/09 09:34, Rolf Vandevaart wrote: I just tried trunk-1.4a1r20458 and I did not see this error, although my configuration was rather different. I ran across 100 2- CPU sparc nodes, np=256, connected with TCP. Hopefully George's comment helps out with this issue. One other thought to see whether SGE has anything to do with this is create a hostfile and run it outside of SGE. Rolf On 02/26/09 22:10, Ralph Castain wrote: FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and everything checks out fine. However, this is running under SGE and thus using qrsh, so it is possible the SGE support is having a problem. Perhaps one of the Sun OMPI developers can help here? Ralph On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote: It looks like the system doesn't know what nodes the procs are to be placed upon. Can you run this with --display-devel-map? That will tell us where the system thinks it is placing things. Thanks Ralph On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote: Maybe it's my pine mailer. This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD shangai nodes running a standard benchmark called stmv. The basic error message, which occurs 31 times is like: [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595 The mpirun command has long paths in it, sorry. It's invoking a special binding script which in turn lauches the NAMD run. This works on an older SVN at level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 256 proc run where the older SVN hangs indefinitely polling some completion (sm or openib). So, I was trying later SVNs with this 256 proc run, hoping the error would go away. Here's some of the invocation again. Hope you can read it: EAGER_SIZE=32767 export OMPI_MCA_btl_openib_use_eager_rdma=0 export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE and, unexpanded mpirun --prefix $PREFIX -np %PE% $MCA -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd and, expanded mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/ intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 -- mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/ newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/ mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/ intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/ NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd This is all via Sun Grid Engine. The OS as indicated above is SuSE SLES 10 SP2. DM On Thu, 26 Feb 2009, Ralph Castain wrote: I'm sorry, but I can't make any sense of this message. Could you provide a little explanation of what you are doing, what the system looks like, what is supposed to happen, etc? I can barely parse your cmd line... Thanks Ralph On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote: Today's and yesterdays. 1.4a1r20643_svn + mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/ openib/suse_sles_10/x86_6 4/opteron -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48 269.1.all.q/newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IM SC/bench_intel_openmpi_I_shang2/ intel-10.1.015_ofed
Re: [OMPI users] TCP instead of openIB doesn't work
I notice the following: - you're creating an *enormous* array on the stack. you might be better allocating it on the heap. - the value of "exchanged" will quickly grow beyond 2^31 (i.e., MAX_INT) which is the max that the MPI API can handle. Bad Things can/ will happen beyond that value (i.e., you're keeping the value of "exchanged" in a long unsigned int, but MPI_Send and MPI_Recv only take an int). On Feb 27, 2009, at 10:00 AM, Vittorio Giovara wrote: Hello, i'm posting here another problem of my installation I wanted to benchmark the differences between tcp and openib transport if i run a simple non mpi application i get randori ~ # mpirun --mca btl tcp,self -np 2 -host randori -host tatami hostname randori tatami but as soon as i switch to my benchmark program i have mpirun --mca btl tcp,self -np 2 -host randori -host tatami graph Master thread reporting matrix size 33554432 kB, time is in [us] and instead of starting the send/receive functions it just hangs there; i also checked the transmitted packets with wireshark but after the handshake no more packets are exchanged I read in the archives that there were some problems in this area and so i tried what was suggested in previous emails mpirun --mca btl ^openib -np 2 -host randori -host tatami graph mpirun --mca pml ob1 --mca btl tcp,self -np 2 -host randori -host tatami graph gives exactly the same output as before (no mpisend/receive) while the next commands gives something more interesting mpirun --mca pml cm --mca btl tcp,self -np 2 -host randori -host tatami graph -- No available pml components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- [tatami:06619] PML cm cannot be selected mpirun noticed that job rank 0 with PID 6710 on node randori exited on signal 15 (Terminated). which is not possible as if i do ompi_info --param all there is the CM pml component MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8) my test program is quite simple, just a couple of MPI_Send and MPI_Recv (just after the signature) do you have any ideas that might help me? thanks a lot Vittorio #include "mpi.h" #include #include #include #include #define M_COL 4096 #define M_ROW 524288 #define NUM_MSG 25 unsigned long int gigamatrix[M_ROW][M_COL]; int main (int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tmp, count, tag=1; unsigned long int exp, exchanged; unsigned long int i, j, e; unsigned long matsize; MPI_Status Stat; struct timeval timing_start, timing_end; double inittime = 0; long int totaltime = 0; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numtasks); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) { fprintf (stderr, "Master thread reporting\n", numtasks - 1); matsize = (long) M_COL * M_ROW / 64; fprintf (stderr, "matrix size %d kB, time is in [us]\n", matsize); source = 1; dest = 1; /*warm up phase*/ rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&tmp, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send (&tmp, 1, MPI_INT, dest, tag, MPI_COMM_WORLD); for (e = 0; e < NUM_MSG; e++) { exp = pow (2, e); exchanged = 64 * exp; /*timing of ops*/ gettimeofday (&timing_start, NULL); rc = MPI_Send (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv (&gigamatrix[0], exchanged, MPI_UNSIGNED_LONG, source, tag, MPI_COMM_WORLD, &Stat); gettimeofday (&timing_end, NULL); totaltime = (timing_end.tv_sec - timing_start.tv_sec) * 100 + (timing_end.tv_usec - timing_start.tv_usec); memset (&timing_start, 0, sizeof(struct timeval)); memset (&timing_end, 0, sizeof(struct timeval)); fprintf (stdout, "%d kB\t%d\n", exp, totaltime); } fprintf(stderr, "task complete\n"); } else { if (rank >= 1) {