Dear Open MPI developers,

I'm using Open MPI 1.2.2 over OFED 1.1 on an 680 nodes dual Opteron dual
core Linux cluster. Of course, with Infiniband interconnect. 
During the execution of big jobs (greater than 128 processes) I've
experimented slow down in performances and deadlock in collective MPI
operations. The job processes terminates often issuing "RETRY EXCEEDED
ERROR", of course if the  btl_openib_ib_timeout is properly set.  
Yes, this kind of error seems to be related to the fabric, but more or
less half of the MPI processes are incurring in timeout..... 
In order to do a better investigation on that behaviour, I've tried to
do some "constrained" tests using SKaMPI, but it is quite difficult to
insulate a single collective operation using SKaMPI. In fact despite the
SKaMPI script can contain only a request for (say) a Reduce, with many
communicator sizes, the SKaMPI code will make also a lot of bcast,
alltoall etc. by itself.
So I've tried to use an hand made piece of code, in order to do "only" a
repeated collective operation at a time.
The code is attached to this message, the file is named
collect_noparms.c.
What is happened when I've tried to run this code is reported here:

......

011 - 011 - 039 NOOT START
000 - 000 of 38 - 655360  0.000000
[node1049:11804] *** Process received signal ***
[node1049:11804] Signal: Segmentation fault (11)
[node1049:11804] Signal code: Address not mapped (1)
[node1049:11804] Failing at address: 0x18
035 - 035 - 039 NOOT START
000 - 000 of 38 - 786432  0.000000
[node1049:11804] [ 0] /lib64/tls/libpthread.so.0 [0x2a964db420]
000 - 000 of 38 - 917504  0.000000
[node1049:11804] [ 1] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a9573fa18]
[node1049:11804] [ 2] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a9573f639]
[node1049:11804] [ 3] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_btl_sm_send+0x122)
 [0x2a9573f5e1]
[node1049:11804] [ 4] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957acac6]
[node1049:11804] [ 5] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x303)
 [0x2a957ace52]
[node1049:11804] [ 6] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957a2788]
[node1049:11804] [ 7] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957a251c]
[node1049:11804] [ 8] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send+0x2e2)
 [0x2a957a2d9e]
[node1049:11804] [ 9] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_generic+0x651)
 [0x2a95751621]
[node1049:11804] [10] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_pipeline+0x176)
 [0x2a95751bff]
[node1049:11804] [11] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x3f4)
 [0x2a957475f6]
[node1049:11804] [12] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(PMPI_Reduce+0x3a6)
 [0x2a9570a076]
[node1049:11804] [13] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(reduce+0x3e) 
[0x404e64]
[node1049:11804] [14] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(main+0x620) 
[0x404c8e]
[node1049:11804] [15] /lib64/tls/libc.so.6(__libc_start_main+0xdb) 
[0x2a966004bb]
[node1049:11804] [16] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x [0x40448a]
[node1049:11804] *** End of error message ***

.......

the behaviour is the same, more or less identical, using either
Infiniband or Gigabit interconnect. If I use another MPI implementation
(say MVAPICH), all goes right.
Then I've compiled both my code and Open MPI using gcc 3.4.4 with
bounds-checking, compiler debugging flags, without OMPI memory
manager ... the behaviour is identical but now I've the line were the
SIGSEGV is trapped:


----------------------------------------------------------------------------------------------------------------
gdb collect_noparms_bc.x core.11580
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db 
library "/lib64/tls/libthread_db.so.1".


warning: core file may not match specified executable file.
Core was generated by 
`/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from 
/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0...done.
Loaded symbols for 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0
Reading symbols from 
/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0...done.
Loaded symbols for 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0
Reading symbols from 
/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0...done.
Loaded symbols for 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0
Reading symbols from /usr/local/ofed/lib64/libibverbs.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibverbs.so.1
Reading symbols from /lib64/tls/librt.so.1...done.
Loaded symbols for /lib64/tls/librt.so.1
Reading symbols from /usr/lib64/libnuma.so.1...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/tls/libm.so.6...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/tls/libpthread.so.0...done.
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /usr/lib64/libsysfs.so.1...done.
Loaded symbols for /usr/lib64/libsysfs.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/local/ofed/lib64/infiniband/ipathverbs.so...done.
Loaded symbols for /usr/local/ofed/lib64/infiniband/ipathverbs.so
Reading symbols from /usr/local/ofed/lib64/infiniband/mthca.so...done.
Loaded symbols for /usr/local/ofed/lib64/infiniband/mthca.so
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
#0  0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr 
(data=0x2a96f7df80, fifo=0x0)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370
370         h_ptr=fifo->head;
(gdb) bt
#0  0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr 
(data=0x2a96f7df80, fifo=0x0)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370
#1  0x0000002a9573f639 in ompi_fifo_write_to_head_same_base_addr 
(data=0x2a96f7df80, fifo=0x2a96e476a0, fifo_allocator=0x674100)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_fifo.h:312
#2  0x0000002a9573f5e1 in mca_btl_sm_send (btl=0x2a95923440, endpoint=0x6e9670, 
descriptor=0x2a96f7df80, tag=1 '\001')
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/btl/sm/btl_sm.c:894
#3  0x0000002a957acac6 in mca_bml_base_send (bml_btl=0x67fc00, 
des=0x2a96f7df80, tag=1 '\001')
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/bml/bml.h:283
#4  0x0000002a957ace52 in mca_pml_ob1_send_request_start_copy 
(sendreq=0x594080, bml_btl=0x67fc00, size=1024)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.c:565
#5  0x0000002a957a2788 in mca_pml_ob1_send_request_start_btl (sendreq=0x594080, 
bml_btl=0x67fc00)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:278
#6  0x0000002a957a251c in mca_pml_ob1_send_request_start (sendreq=0x594080)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:345
#7  0x0000002a957a2d9e in mca_pml_ob1_send (buf=0x7b8400, count=256, 
datatype=0x51b8b0, dst=37, tag=-21,
    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x521c00) at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_isend.c:103
#8  0x0000002a95751621 in ompi_coll_tuned_reduce_generic (sendbuf=0x7b8000, 
recvbuf=0x8b9000, original_count=32512,
    datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00, tree=0x520b00, 
count_by_segment=256)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:187
#9  0x0000002a95751bff in ompi_coll_tuned_reduce_intra_pipeline 
(sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0,
    op=0x51ba40, root=0, comm=0x521c00, segsize=1024)
    at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:255
#10 0x0000002a957475f6 in ompi_coll_tuned_reduce_intra_dec_fixed 
(sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0,
    op=0x51ba40, root=0, comm=0x521c00) at 
/cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:353
#11 0x0000002a9570a076 in PMPI_Reduce (sendbuf=0x7b8000, recvbuf=0x8b9000, 
count=32768, datatype=0x51b8b0, op=0x51ba40, root=0,
    comm=0x521c00) at preduce.c:96
#12 0x0000000000404e64 in reduce (comm=0x521c00, count=32768) at 
collect_noparms.c:248
#13 0x0000000000404c8e in main (argc=1, argv=0x7fbffff308) at 
collect_noparms.c:187
(gdb) 
-----------------------------------------


I think this bug is not related to my performance slowdown in collective
operations but ..... something seems to be wrong at an higher level in
MCA framework ..... 
Is there someone able to reproduce a similar bug? 
Is there someone having performance slowdown in collective operations
with big jobs using OFED 1.1 over InfiniBand interconnect? 
Does I need some further btl or coll tuning? (I've tried with SRQ but
that doesn't resolve my problems).  


Marco 

-- 
-----------------------------------------------------------------
 Marco Sbrighi  m.sbri...@cineca.it

 HPC Group
 CINECA Interuniversity Computing Centre
 via Magnanelli, 6/3
 40033 Casalecchio di Reno (Bo) ITALY
 tel. 051 6171516
/**** (c) Marco Sbrighi - CINECA ****/



#include "mpi.h"

#include <stdlib.h>
#include <limits.h>
#include <stdio.h>

#ifndef HOST_NAME_MAX
#pragma warn self defined HOST_NAME_MAX
#define HOST_NAME_MAX 255
#endif

#ifndef _POSIX_PATH_MAX
#pragma warn self defined _POSIX_PATH_MAX
#define _POSIX_PATH_MAX 2048
#endif

int ReduceExitStatus(int rank, int exitstat, FILE* out);
int exitall(int rank, int exitstat, FILE* out);
void checkAbort(MPI_Comm comm, int err);
int checkFail(MPI_Comm comm, int err);
int (*op ) (MPI_Comm,int);
int bcast(MPI_Comm,int);
int reduce(MPI_Comm,int);
int allreduce(MPI_Comm, int);


char myname[LINE_MAX];


char *wbuf, *rbuf;


int ReduceExitStatus(int rank, int exitstat, FILE* out)
{
  int commstat,retc;
  commstat=0;
  retc= MPI_Allreduce (&exitstat, &commstat,1, MPI_INT, MPI_BOR,MPI_COMM_WORLD);
  fprintf(stdout, "Reducing %d. Allreduce is exiting with status %d reporting %d to cummunicator.\n",exitstat,retc,commstat );

  return  (commstat);
}

int exitall(int rank, int exitstat, FILE* out) {
  int commstat;
  commstat=0;
  commstat=ReduceExitStatus(rank,exitstat,out);
  MPI_Finalize();
  return  (commstat);
}

void checkAbort(MPI_Comm comm, int err)
{
  if (err != MPI_SUCCESS) MPI_Abort(comm, err);
}

int checkFail(MPI_Comm comm, int err)
{
  return err == MPI_SUCCESS ? 1:0;
}


int myid, n_myid;


char processor_name[MPI_MAX_PROCESSOR_NAME];




int main(int argc, char *argv[])
{
    int  i, namelen;


    int last_opt,j;
    size_t count;

    //    size_t bsize;

    size_t minbuf, maxbuf, stepbuf;
    int minc, maxc, stepc;
    int err, color,key;
    MPI_Comm n_comm;
    double stime,etime,ttime;
    double timeout;
    double status;
    int numprocs;
    char* opname; 
    //    double sbuf[4];

    //    usec_timer_t t;

    void *attr_value;
    int flag, commsize;
    size_t bufsize;
    int rep, maxrep;

    long long deltat;
    //mpirun.lsf ./collect.sh -d 1 -minc 35 -minbuf 0 -maxbuf 1048576 -stepbuf 131072 -op reduce 

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);

    processor_name[namelen]=(char)0;

    if ( numprocs < 2 ) {

      if (myid==0) { 
	fprintf(stderr, "--> please launch at least with 2 MPI processes\n");
      }
      return exitall(myid,0,stderr); 
    }

    minc=35;
    maxc=numprocs;
    stepc=1;
    maxbuf=1048576;
    minbuf=0;
    stepbuf=131072;
    maxrep=20;
    op=reduce;
    timeout= 30000.0/1000.0;



    if (myid==0) {
      /* sync Wtime? */
      err=MPI_Attr_get (MPI_COMM_WORLD, MPI_WTIME_IS_GLOBAL, &attr_value, &flag);
      checkAbort(MPI_COMM_WORLD,err);

      if (flag) {
	if ( *(int*)attr_value < 0 || *(int*)attr_value > 1)
	  fprintf(stdout, "The value of WTIME_IS_GLOBAL %d is not valid.\n", *(int*)attr_value );
	else 
	  fprintf(stdout, "This implementation support MPI_Wtime sync across processes! Enjoy.\n");
      }

    }

    fflush(stdout); 

    if ( (wbuf=(char*) malloc ( maxbuf*sizeof(char))) == NULL) {
       fprintf(stderr, "%d - Unable to allocate %lld bytes of memory.\n", myid, (long long int) maxbuf);
       MPI_Abort(MPI_COMM_WORLD,2);
    }

    if ( (rbuf=(char*) malloc ( maxbuf*sizeof(char))) == NULL) {
       fprintf(stderr, "%d - Unable to allocate %lld bytes of memory.\n", myid, (long long int) maxbuf);
       MPI_Abort(MPI_COMM_WORLD,2);
    }
    /* root processor =0, for easy */

    //    RENAME_UTIMER(&t, "Collective");
    // RESET_UTIMER(&t,NULL);      

    if ( myid==0) fprintf(stdout,"RA  -  RB    np   size   msec\n"); 

    for ( commsize = minc; commsize <= maxc; commsize += stepc) {

/*       err=MPI_Barrier(MPI_COMM_WORLD); */
/*       checkAbort ( MPI_COMM_WORLD, err ) ; */

      color = (myid < commsize ? 1 : 2);  key = 0;
      err=MPI_Comm_split(MPI_COMM_WORLD,color,key,&n_comm);
      checkAbort(MPI_COMM_WORLD,err);
      err=MPI_Comm_rank(n_comm, &n_myid);
      checkAbort(MPI_COMM_WORLD,err);



      if ( color == 1 ) {
	if (n_myid==0) {
	  fprintf(stdout, "%03d - %03d - %03d ROOT START\n", myid, n_myid, commsize);
	  /********************************* the ROOT BEGINS HERE *******************************************/
	} else {
	  /**************************** the NON ROOT BEGINS HERE ***********************************************/
	  fprintf(stdout, "%03d - %03d - %03d NOOT START\n", myid, n_myid, commsize);
	} 
	fflush(stdout);
	for ( bufsize = minbuf; bufsize <= maxbuf; bufsize += stepbuf ) {

	  count = bufsize/sizeof(int);

	  //	  if ( n_myid == 0)  START_UTIMER(&t);
	  for ( rep=0; rep< maxrep; rep++ ) {

	    err=op(n_comm, count);
	    checkAbort ( MPI_COMM_WORLD, err ) ;
	    //            err=MPI_Barrier(n_comm);
	    //            checkAbort ( MPI_COMM_WORLD, err ) ;
	  }
	  if ( n_myid == 0) { 
	    /*	    STOP_UTIMER(&t);
	    deltat=get_lastelap_utimer(&t);
	    fprintf(stdout, "%03d - %03d of %d - %lld  %f\n", n_myid, myid, commsize, (long long) count*sizeof(int), ((double) deltat)/(USEC_PER_SEC*maxrep) ); */
	    fprintf(stdout, "%03d - %03d of %d - %lld  %f\n", n_myid, myid, commsize, (long long int) (count*sizeof(int)), 0.0 );
	    fflush(stdout);
	  }

	}

	if ( n_myid == 0) { 

	  fprintf(stdout, "%03d - %03d - %03d ROOT STOP\n", myid, n_myid, commsize);
	  /********************************* the ROOT STOP HERE *******************************************/
	} else {
	  /**************************** the NON ROOT STOP HERE **********************************************/
	  fprintf(stdout, "%03d - %03d - %03d NOOT STOP\n", myid, n_myid, commsize);
	}
	fflush(stdout);
      }

      //      err=MPI_Barrier(MPI_COMM_WORLD);
      //checkAbort ( MPI_COMM_WORLD, err ) ;
      err=MPI_Comm_free(&n_comm);
      checkAbort ( MPI_COMM_WORLD, err ) ;
/*       if (n_myid==0) { */
/* 	fprintf(stdout, "%03d - %03d - %03d ROOT STOP\n", myid, n_myid, commsize); */
/* 	/\********************************* the ROOT STOP HERE *******************************************\/ */
/*       } else { */
/* 	/\**************************** the NON ROOT STOP HERE ***********************************************\/ */
/* 	fprintf(stdout, "%03d - %03d - %03d NOOT STOP\n", myid, n_myid, commsize); */
/*       } */
      fflush(stdout);


    }  




    MPI_Finalize();   
    return (0);

}


int bcast( MPI_Comm comm, int count ) 
{

  return MPI_Bcast ( wbuf, count, MPI_INT, 0, comm );

}

int reduce( MPI_Comm comm, int count ) 
{

  return MPI_Reduce ( wbuf, rbuf, count, MPI_INT, MPI_SUM, 0, comm );

}

int allreduce( MPI_Comm comm, int count ) 
{

  return MPI_Allreduce ( wbuf, rbuf, count, MPI_INT, MPI_SUM, comm );

}

Reply via email to