Dear Open MPI developers,
I'm using Open MPI 1.2.2 over OFED 1.1 on an 680 nodes dual Opteron dual core Linux cluster. Of course, with Infiniband interconnect. During the execution of big jobs (greater than 128 processes) I've experimented slow down in performances and deadlock in collective MPI operations. The job processes terminates often issuing "RETRY EXCEEDED ERROR", of course if the btl_openib_ib_timeout is properly set. Yes, this kind of error seems to be related to the fabric, but more or less half of the MPI processes are incurring in timeout..... In order to do a better investigation on that behaviour, I've tried to do some "constrained" tests using SKaMPI, but it is quite difficult to insulate a single collective operation using SKaMPI. In fact despite the SKaMPI script can contain only a request for (say) a Reduce, with many communicator sizes, the SKaMPI code will make also a lot of bcast, alltoall etc. by itself. So I've tried to use an hand made piece of code, in order to do "only" a repeated collective operation at a time. The code is attached to this message, the file is named collect_noparms.c. What is happened when I've tried to run this code is reported here: ...... 011 - 011 - 039 NOOT START 000 - 000 of 38 - 655360 0.000000 [node1049:11804] *** Process received signal *** [node1049:11804] Signal: Segmentation fault (11) [node1049:11804] Signal code: Address not mapped (1) [node1049:11804] Failing at address: 0x18 035 - 035 - 039 NOOT START 000 - 000 of 38 - 786432 0.000000 [node1049:11804] [ 0] /lib64/tls/libpthread.so.0 [0x2a964db420] 000 - 000 of 38 - 917504 0.000000 [node1049:11804] [ 1] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a9573fa18] [node1049:11804] [ 2] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a9573f639] [node1049:11804] [ 3] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_btl_sm_send+0x122) [0x2a9573f5e1] [node1049:11804] [ 4] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957acac6] [node1049:11804] [ 5] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x303) [0x2a957ace52] [node1049:11804] [ 6] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957a2788] [node1049:11804] [ 7] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957a251c] [node1049:11804] [ 8] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send+0x2e2) [0x2a957a2d9e] [node1049:11804] [ 9] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_generic+0x651) [0x2a95751621] [node1049:11804] [10] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_pipeline+0x176) [0x2a95751bff] [node1049:11804] [11] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x3f4) [0x2a957475f6] [node1049:11804] [12] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(PMPI_Reduce+0x3a6) [0x2a9570a076] [node1049:11804] [13] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(reduce+0x3e) [0x404e64] [node1049:11804] [14] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(main+0x620) [0x404c8e] [node1049:11804] [15] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a966004bb] [node1049:11804] [16] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x [0x40448a] [node1049:11804] *** End of error message *** ....... the behaviour is the same, more or less identical, using either Infiniband or Gigabit interconnect. If I use another MPI implementation (say MVAPICH), all goes right. Then I've compiled both my code and Open MPI using gcc 3.4.4 with bounds-checking, compiler debugging flags, without OMPI memory manager ... the behaviour is identical but now I've the line were the SIGSEGV is trapped: ---------------------------------------------------------------------------------------------------------------- gdb collect_noparms_bc.x core.11580 GNU gdb Red Hat Linux (6.3.0.0-1.96rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". warning: core file may not match specified executable file. Core was generated by `/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x'. Program terminated with signal 11, Segmentation fault. Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0...done. Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0...done. Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0 Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0...done. Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0 Reading symbols from /usr/local/ofed/lib64/libibverbs.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libibverbs.so.1 Reading symbols from /lib64/tls/librt.so.1...done. Loaded symbols for /lib64/tls/librt.so.1 Reading symbols from /usr/lib64/libnuma.so.1...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /lib64/libnsl.so.1...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libutil.so.1...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/tls/libm.so.6...done. Loaded symbols for /lib64/tls/libm.so.6 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/tls/libpthread.so.0...done. Loaded symbols for /lib64/tls/libpthread.so.0 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /usr/lib64/libsysfs.so.1...done. Loaded symbols for /usr/lib64/libsysfs.so.1 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/local/ofed/lib64/infiniband/ipathverbs.so...done. Loaded symbols for /usr/local/ofed/lib64/infiniband/ipathverbs.so Reading symbols from /usr/local/ofed/lib64/infiniband/mthca.so...done. Loaded symbols for /usr/local/ofed/lib64/infiniband/mthca.so Reading symbols from /lib64/libgcc_s.so.1...done. Loaded symbols for /lib64/libgcc_s.so.1 #0 0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x0) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370 370 h_ptr=fifo->head; (gdb) bt #0 0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x0) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370 #1 0x0000002a9573f639 in ompi_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x2a96e476a0, fifo_allocator=0x674100) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_fifo.h:312 #2 0x0000002a9573f5e1 in mca_btl_sm_send (btl=0x2a95923440, endpoint=0x6e9670, descriptor=0x2a96f7df80, tag=1 '\001') at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/btl/sm/btl_sm.c:894 #3 0x0000002a957acac6 in mca_bml_base_send (bml_btl=0x67fc00, des=0x2a96f7df80, tag=1 '\001') at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/bml/bml.h:283 #4 0x0000002a957ace52 in mca_pml_ob1_send_request_start_copy (sendreq=0x594080, bml_btl=0x67fc00, size=1024) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.c:565 #5 0x0000002a957a2788 in mca_pml_ob1_send_request_start_btl (sendreq=0x594080, bml_btl=0x67fc00) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:278 #6 0x0000002a957a251c in mca_pml_ob1_send_request_start (sendreq=0x594080) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:345 #7 0x0000002a957a2d9e in mca_pml_ob1_send (buf=0x7b8400, count=256, datatype=0x51b8b0, dst=37, tag=-21, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x521c00) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_isend.c:103 #8 0x0000002a95751621 in ompi_coll_tuned_reduce_generic (sendbuf=0x7b8000, recvbuf=0x8b9000, original_count=32512, datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00, tree=0x520b00, count_by_segment=256) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:187 #9 0x0000002a95751bff in ompi_coll_tuned_reduce_intra_pipeline (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00, segsize=1024) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:255 #10 0x0000002a957475f6 in ompi_coll_tuned_reduce_intra_dec_fixed (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:353 #11 0x0000002a9570a076 in PMPI_Reduce (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00) at preduce.c:96 #12 0x0000000000404e64 in reduce (comm=0x521c00, count=32768) at collect_noparms.c:248 #13 0x0000000000404c8e in main (argc=1, argv=0x7fbffff308) at collect_noparms.c:187 (gdb) ----------------------------------------- I think this bug is not related to my performance slowdown in collective operations but ..... something seems to be wrong at an higher level in MCA framework ..... Is there someone able to reproduce a similar bug? Is there someone having performance slowdown in collective operations with big jobs using OFED 1.1 over InfiniBand interconnect? Does I need some further btl or coll tuning? (I've tried with SRQ but that doesn't resolve my problems). Marco -- ----------------------------------------------------------------- Marco Sbrighi m.sbri...@cineca.it HPC Group CINECA Interuniversity Computing Centre via Magnanelli, 6/3 40033 Casalecchio di Reno (Bo) ITALY tel. 051 6171516
/**** (c) Marco Sbrighi - CINECA ****/ #include "mpi.h" #include <stdlib.h> #include <limits.h> #include <stdio.h> #ifndef HOST_NAME_MAX #pragma warn self defined HOST_NAME_MAX #define HOST_NAME_MAX 255 #endif #ifndef _POSIX_PATH_MAX #pragma warn self defined _POSIX_PATH_MAX #define _POSIX_PATH_MAX 2048 #endif int ReduceExitStatus(int rank, int exitstat, FILE* out); int exitall(int rank, int exitstat, FILE* out); void checkAbort(MPI_Comm comm, int err); int checkFail(MPI_Comm comm, int err); int (*op ) (MPI_Comm,int); int bcast(MPI_Comm,int); int reduce(MPI_Comm,int); int allreduce(MPI_Comm, int); char myname[LINE_MAX]; char *wbuf, *rbuf; int ReduceExitStatus(int rank, int exitstat, FILE* out) { int commstat,retc; commstat=0; retc= MPI_Allreduce (&exitstat, &commstat,1, MPI_INT, MPI_BOR,MPI_COMM_WORLD); fprintf(stdout, "Reducing %d. Allreduce is exiting with status %d reporting %d to cummunicator.\n",exitstat,retc,commstat ); return (commstat); } int exitall(int rank, int exitstat, FILE* out) { int commstat; commstat=0; commstat=ReduceExitStatus(rank,exitstat,out); MPI_Finalize(); return (commstat); } void checkAbort(MPI_Comm comm, int err) { if (err != MPI_SUCCESS) MPI_Abort(comm, err); } int checkFail(MPI_Comm comm, int err) { return err == MPI_SUCCESS ? 1:0; } int myid, n_myid; char processor_name[MPI_MAX_PROCESSOR_NAME]; int main(int argc, char *argv[]) { int i, namelen; int last_opt,j; size_t count; // size_t bsize; size_t minbuf, maxbuf, stepbuf; int minc, maxc, stepc; int err, color,key; MPI_Comm n_comm; double stime,etime,ttime; double timeout; double status; int numprocs; char* opname; // double sbuf[4]; // usec_timer_t t; void *attr_value; int flag, commsize; size_t bufsize; int rep, maxrep; long long deltat; //mpirun.lsf ./collect.sh -d 1 -minc 35 -minbuf 0 -maxbuf 1048576 -stepbuf 131072 -op reduce MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); processor_name[namelen]=(char)0; if ( numprocs < 2 ) { if (myid==0) { fprintf(stderr, "--> please launch at least with 2 MPI processes\n"); } return exitall(myid,0,stderr); } minc=35; maxc=numprocs; stepc=1; maxbuf=1048576; minbuf=0; stepbuf=131072; maxrep=20; op=reduce; timeout= 30000.0/1000.0; if (myid==0) { /* sync Wtime? */ err=MPI_Attr_get (MPI_COMM_WORLD, MPI_WTIME_IS_GLOBAL, &attr_value, &flag); checkAbort(MPI_COMM_WORLD,err); if (flag) { if ( *(int*)attr_value < 0 || *(int*)attr_value > 1) fprintf(stdout, "The value of WTIME_IS_GLOBAL %d is not valid.\n", *(int*)attr_value ); else fprintf(stdout, "This implementation support MPI_Wtime sync across processes! Enjoy.\n"); } } fflush(stdout); if ( (wbuf=(char*) malloc ( maxbuf*sizeof(char))) == NULL) { fprintf(stderr, "%d - Unable to allocate %lld bytes of memory.\n", myid, (long long int) maxbuf); MPI_Abort(MPI_COMM_WORLD,2); } if ( (rbuf=(char*) malloc ( maxbuf*sizeof(char))) == NULL) { fprintf(stderr, "%d - Unable to allocate %lld bytes of memory.\n", myid, (long long int) maxbuf); MPI_Abort(MPI_COMM_WORLD,2); } /* root processor =0, for easy */ // RENAME_UTIMER(&t, "Collective"); // RESET_UTIMER(&t,NULL); if ( myid==0) fprintf(stdout,"RA - RB np size msec\n"); for ( commsize = minc; commsize <= maxc; commsize += stepc) { /* err=MPI_Barrier(MPI_COMM_WORLD); */ /* checkAbort ( MPI_COMM_WORLD, err ) ; */ color = (myid < commsize ? 1 : 2); key = 0; err=MPI_Comm_split(MPI_COMM_WORLD,color,key,&n_comm); checkAbort(MPI_COMM_WORLD,err); err=MPI_Comm_rank(n_comm, &n_myid); checkAbort(MPI_COMM_WORLD,err); if ( color == 1 ) { if (n_myid==0) { fprintf(stdout, "%03d - %03d - %03d ROOT START\n", myid, n_myid, commsize); /********************************* the ROOT BEGINS HERE *******************************************/ } else { /**************************** the NON ROOT BEGINS HERE ***********************************************/ fprintf(stdout, "%03d - %03d - %03d NOOT START\n", myid, n_myid, commsize); } fflush(stdout); for ( bufsize = minbuf; bufsize <= maxbuf; bufsize += stepbuf ) { count = bufsize/sizeof(int); // if ( n_myid == 0) START_UTIMER(&t); for ( rep=0; rep< maxrep; rep++ ) { err=op(n_comm, count); checkAbort ( MPI_COMM_WORLD, err ) ; // err=MPI_Barrier(n_comm); // checkAbort ( MPI_COMM_WORLD, err ) ; } if ( n_myid == 0) { /* STOP_UTIMER(&t); deltat=get_lastelap_utimer(&t); fprintf(stdout, "%03d - %03d of %d - %lld %f\n", n_myid, myid, commsize, (long long) count*sizeof(int), ((double) deltat)/(USEC_PER_SEC*maxrep) ); */ fprintf(stdout, "%03d - %03d of %d - %lld %f\n", n_myid, myid, commsize, (long long int) (count*sizeof(int)), 0.0 ); fflush(stdout); } } if ( n_myid == 0) { fprintf(stdout, "%03d - %03d - %03d ROOT STOP\n", myid, n_myid, commsize); /********************************* the ROOT STOP HERE *******************************************/ } else { /**************************** the NON ROOT STOP HERE **********************************************/ fprintf(stdout, "%03d - %03d - %03d NOOT STOP\n", myid, n_myid, commsize); } fflush(stdout); } // err=MPI_Barrier(MPI_COMM_WORLD); //checkAbort ( MPI_COMM_WORLD, err ) ; err=MPI_Comm_free(&n_comm); checkAbort ( MPI_COMM_WORLD, err ) ; /* if (n_myid==0) { */ /* fprintf(stdout, "%03d - %03d - %03d ROOT STOP\n", myid, n_myid, commsize); */ /* /\********************************* the ROOT STOP HERE *******************************************\/ */ /* } else { */ /* /\**************************** the NON ROOT STOP HERE ***********************************************\/ */ /* fprintf(stdout, "%03d - %03d - %03d NOOT STOP\n", myid, n_myid, commsize); */ /* } */ fflush(stdout); } MPI_Finalize(); return (0); } int bcast( MPI_Comm comm, int count ) { return MPI_Bcast ( wbuf, count, MPI_INT, 0, comm ); } int reduce( MPI_Comm comm, int count ) { return MPI_Reduce ( wbuf, rbuf, count, MPI_INT, MPI_SUM, 0, comm ); } int allreduce( MPI_Comm comm, int count ) { return MPI_Allreduce ( wbuf, rbuf, count, MPI_INT, MPI_SUM, comm ); }