Re: [OMPI users] Problem with MPI_BARRIER

Eugene Loh Thu, 8 Sep 2011 10:41:51 -0400

I agree sentimentally with Ghislain. The time spent in a barriershould conceptually be some wait time, which can be very long (possiblyon the order of milliseconds or even seconds), and the time to executethe barrier operations, which should essentially be "instantaneous" onsome time scale... in any case, very fast (probably on the order ofmicroseconds). The "laggard" who holds up the barrier operation shouldreport a very fast time. The other processes might show very muchlonger times, depending on how well synchronized they were. At leastone, however, should be "fast".

I agree with others that it'd be nice to know the time units so that wecan judge the speeds.

Anyhow, is this on the first barrier of the program? If one repeats thetiming operation several times in succession, do the barrier timesremain high?


On 09/08/11 10:27, Jai Dayal wrote:

what tick value are you using (i.e., what units are you using?)

On Thu, Sep 8, 2011 at 10:25 AM, Ghislain Lartigue<[email protected] <mailto:[email protected]>> wrote:


    Thanks,

    I understand this but the delays that I measure are huge compared
    to a classical ack procedure... (1000x more)
    And this is repeatable: as far as I understand it, this shows that
    the network is not involved.

    Ghislain.


    Le 8 sept. 2011 à 16:16, Teng Ma a écrit :

    > I guess you forget to count the "leaving time"(fan-out).  When
    everyone
    > hits the barrier, it still needs "ack" to leave.  And remember
    in most
    > cases, leader process will send out "acks" in a sequence way.
     It's very
    > possible:
    >
    > P0 barrier time = 29 + send/recv ack 0
    > P1 barrier time = 14 + send ack 0  + send/recv ack 1
    > P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
    >
    > That's your measure time.
    >
    > Teng
    >> This problem as nothing to do with stdout...
    >>
    >> Example with 3 processes:
    >>
    >> P0 hits barrier at t=12
    >> P1 hits barrier at t=27
    >> P2 hits barrier at t=41
    >>
    >> In this situation:
    >> P0 waits 41-12 = 29
    >> P1 waits 41-27 = 14
    >> P2 waits 41-41 = 00
    >
    >
    >
    >> So I should see something  like (no ordering is expected):
    >> barrier_time = 14
    >> barrier_time = 00
    >> barrier_time = 29
    >>
    >> But what I see is much more like
    >> barrier_time = 22
    >> barrier_time = 29
    >> barrier_time = 25
    >>
    >> See? No process has a barrier_time equal to zero !!!
    >>
    >>
    >>
    >> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
    >>
    >>> The order in which you see stdout printed from mpirun is not
    necessarily
    >>> reflective of what order things were actually printers.
     Remember that
    >>> the stdout from each MPI process needs to flow through at least 3
    >>> processes and potentially across the network before it is actually
    >>> displayed on mpirun's stdout.
    >>>
    >>> MPI process -> local Open MPI daemon -> mpirun -> printed to
    mpirun's
    >>> stdout
    >>>
    >>> Hence, the ordering of stdout can get transposed.
    >>>
    >>>
    >>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
    >>>
    >>>> Thank you for this explanation but indeed this confirms that
    the LAST
    >>>> process that hits the barrier should go through nearly
    instantaneously
    >>>> (except for the broadcast time for the acknowledgment signal).
    >>>> And this is not what happens in my code : EVERY process waits
    for a
    >>>> very long time before going through the barrier (thousands of
    times
    >>>> more than a broadcast)...
    >>>>
    >>>>
    >>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
    >>>>
    >>>>> Order in which processes hit the barrier is only one factor
    in the
    >>>>> time it takes for that process to finish the barrier.
    >>>>>
    >>>>> An easy way to think of a barrier implementation is a "fan
    in/fan out"
    >>>>> model.  When each nonzero rank process calls MPI_BARRIER, it
    sends a
    >>>>> message saying "I have hit the barrier!" (it usually sends
    it to its
    >>>>> parent in a tree of all MPI processes in the communicator,
    but you can
    >>>>> simplify this model and consider that it sends it to rank
    0).  Rank 0
    >>>>> collects all of these messages.  When it has messages from all
    >>>>> processes in the communicator, it sends out "ok, you can
    leave the
    >>>>> barrier now" messages (again, it's usually via a tree
    distribution,
    >>>>> but you can pretend that it directly, linearly sends a
    message to each
    >>>>> peer process in the communicator).
    >>>>>
    >>>>> Hence, the time that any individual process spends in the
    communicator
    >>>>> is relative to when every other process enters the
    communicator.  But
    >>>>> it's also dependent upon communication speed, congestion in the
    >>>>> network, etc.
    >>>>>
    >>>>>
    >>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
    >>>>>
    >>>>>> Hello,
    >>>>>>
    >>>>>> at a given point in my (Fortran90) program, I write:
    >>>>>>
    >>>>>> ===================
    >>>>>> start_time = MPI_Wtime()
    >>>>>> call MPI_BARRIER(...)
    >>>>>> new_time = MPI_Wtime() - start_time
    >>>>>> write(*,*) "barrier time =",new_time
    >>>>>> ==================
    >>>>>>
    >>>>>> and then I run my code...
    >>>>>>
    >>>>>> I expected that the values of "new_time" would range from 0
    to Tmax
    >>>>>> (1700 in my case)
    >>>>>> As I understand it, the first process that hits the barrier
    should
    >>>>>> print Tmax and the last process that hits the barrier
    should print 0
    >>>>>> (or a very low value).
    >>>>>>
    >>>>>> But this is not the case: all processes print values in the
    range
    >>>>>> 1400-1700!
    >>>>>>
    >>>>>> Any explanation?
    >>>>>>
    >>>>>> Thanks,
    >>>>>> Ghislain.
    >>>>>>
    >>>>>> PS:
    >>>>>> This small code behaves perfectly in other parts of my code...

Re: [OMPI users] Problem with MPI_BARRIER

Reply via email to