Re: [OMPI users] System hang-up on MPI_Reduce

Glembek Ondřej Wed, 11 Nov 2009 05:00:14 -0500

Thanx for your reply...

My coll_sync_priority is set to 50. See the dump of ompi_info --paramcoll sync below...

Does sticking barriers hurt anything or is it just a cosmetic thing???I'm fine with this solution...


Thanx
Ondrej


$ompi_info --param coll sync

MCA coll: parameter "coll" (current value: <none>,data source: default value)Default selection set of components for thecoll framework (<none> means use all components that can be found)MCA coll: parameter "coll_base_verbose" (currentvalue: "0", data source: default value)Verbosity level for the coll framework (0 =no verbosity)MCA coll: parameter "coll_sync_priority" (currentvalue: "50", data source: default value)Priority of the sync coll component; onlyrelevant if barrier_before or barrier_after is > 0MCA coll: parameter "coll_sync_barrier_before"(current value: "1000", data source: default value)

                          Do a synchronization before each Nth collective

MCA coll: parameter "coll_sync_barrier_after"(current value: "0", data source: default value)

                          Do a synchronization after each Nth collective


Quoting "Ralph Castain" <r...@open-mpi.org>:

Yeah, that is "normal". It has to do with unexpected messages.
When you have procs running at significantly different speeds, thevarious operations get far enough out of sync that the memoryconsumed by recvd messages not yet processed grows too large.
Instead of sticking barriers into your code, you can have OMPI do aninternal sync after every so many operations to avoid the problem.This is done by enabling the "sync" collective component, and thenadjusting the number of operations between forced syncs.
Do an "ompi_info --params coll sync" to see the options. Then setthe coll_sync_priority to something like 100 and it should work foryou.
Ralph

On Nov 10, 2009, at 2:45 PM, Glembek Ondřej wrote:
Hi,
I am using MPI_Reduce operation on 122880x400 matrix of doubles.The parallel job runs on 32 machines, each having differentprocessor in terms of speed, but the architecture and OS is thesame on all machines (x86_64). The task is a typicalmap-and-reduce, i.e. each of the processes collects some data,which is then summed (MPI_Reduce w. MPI_SUM).
Having different processors, each of the jobs comes to theMPI_Reduce in different time.
The *first problem* came when I called MPI_Reduce on the wholematrix. The system ended up with *MPI_ERR_OTHER error*, each timeon different rank. I fixed this problem by chunking up the matrixinto 2048 submatrices, calling MPI_Reduce in cycle.
However *second problem* arose --- MPI_Reduce hangs up... Itapparently gets stuck in some kind of dead-lock or something likethat. It seems that if the processors are of similar speed, theproblem disappears, however I cannot provide this condition all thetime.
I managed to get rid of the problem (at least after fewnon-problematic iterations) by sticking MPI_Barrier before theMPI_Reduce line.
The questions are:

1) is this a usual behavior???
2) is there some kind of timeout for MPI_Reduce???
3) why does MPI_Reduce die on large amount of data if the systemhas enough address space (64 bit compilation)
Thanx
Ondrej Glembek


--
Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~glembek
Bozetechova 2, 612 66        Phone:  +420 54114-1292
Brno, Czech Republic         Fax:    +420 54114-1290

ICQ: 93233896
GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
  Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
  UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290

  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C

Re: [OMPI users] System hang-up on MPI_Reduce

Reply via email to