Re: [OMPI users] Problem with MPI_BARRIER

Ghislain Lartigue Thu, 8 Sep 2011 10:46:44 -0400

and to fix things, the units I use are not the direct result of MPI_Wtime():


new_time = (MPI_Wtime()-start_time)*1e9/(36^3)

This means that you should multiply these times by ~20'000 to have ticks..

Le 8 sept. 2011 à 16:42, Ghislain Lartigue a écrit :

> I will check that, but as I said in first email, this strange behaviour 
> happens only in one place in my code.
> I have the same time/barrier/time procedure in other places (in the same 
> code) and it works perfectly.
> 
> At one place I have the following output (sorted)
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier       1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier       1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      14.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      16.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      25.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      28.4
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      32.6
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      35.3
> .
> .
> .
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      90.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      96.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      99.5
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     101.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     119.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     169.3
> 
> but in the place that concerns me I have (sorted)
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1386.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1401.5
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1412.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1414.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1419.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1428.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1430.4
> .
> .
> .
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1632.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1635.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1660.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1685.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1699.2
> 
> 
> These are the same units...
> You see that in the first place, the "time" to "hit/wait/leave" can be very 
> small compared to the last output...
> 
> 
> Le 8 sept. 2011 à 16:35, Teng Ma a écrit :
> 
>> You'd better check process-core binding in your case.  It looks to me P0
>> and P1 on the same node and P2 on another node, which makes ack to P0/P1
>> go through share memory and ack to P2 through networking.
>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
>> latency is about 20-30 microsec.
>> 
>> Just my guess......
>> 
>> Teng
>>> Thanks,
>>> 
>>> I understand this but the delays that I measure are huge compared to a
>>> classical ack procedure... (1000x more)
>>> And this is repeatable: as far as I understand it, this shows that the
>>> network is not involved.
>>> 
>>> Ghislain.
>>> 
>>> 
>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>> 
>>>> I guess you forget to count the "leaving time"(fan-out).  When everyone
>>>> hits the barrier, it still needs "ack" to leave.  And remember in most
>>>> cases, leader process will send out "acks" in a sequence way.  It's very
>>>> possible:
>>>> 
>>>> P0 barrier time = 29 + send/recv ack 0
>>>> P1 barrier time = 14 + send ack 0  + send/recv ack 1
>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>> 
>>>> That's your measure time.
>>>> 
>>>> Teng
>>>>> This problem as nothing to do with stdout...
>>>>> 
>>>>> Example with 3 processes:
>>>>> 
>>>>> P0 hits barrier at t=12
>>>>> P1 hits barrier at t=27
>>>>> P2 hits barrier at t=41
>>>>> 
>>>>> In this situation:
>>>>> P0 waits 41-12 = 29
>>>>> P1 waits 41-27 = 14
>>>>> P2 waits 41-41 = 00
>>>> 
>>>> 
>>>> 
>>>>> So I should see something  like (no ordering is expected):
>>>>> barrier_time = 14
>>>>> barrier_time = 00
>>>>> barrier_time = 29
>>>>> 
>>>>> But what I see is much more like
>>>>> barrier_time = 22
>>>>> barrier_time = 29
>>>>> barrier_time = 25
>>>>> 
>>>>> See? No process has a barrier_time equal to zero !!!
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>>> 
>>>>>> The order in which you see stdout printed from mpirun is not
>>>>>> necessarily
>>>>>> reflective of what order things were actually printers.  Remember that
>>>>>> the stdout from each MPI process needs to flow through at least 3
>>>>>> processes and potentially across the network before it is actually
>>>>>> displayed on mpirun's stdout.
>>>>>> 
>>>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's
>>>>>> stdout
>>>>>> 
>>>>>> Hence, the ordering of stdout can get transposed.
>>>>>> 
>>>>>> 
>>>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>>> 
>>>>>>> Thank you for this explanation but indeed this confirms that the LAST
>>>>>>> process that hits the barrier should go through nearly
>>>>>>> instantaneously
>>>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>>>> And this is not what happens in my code : EVERY process waits for a
>>>>>>> very long time before going through the barrier (thousands of times
>>>>>>> more than a broadcast)...
>>>>>>> 
>>>>>>> 
>>>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>>> 
>>>>>>>> Order in which processes hit the barrier is only one factor in the
>>>>>>>> time it takes for that process to finish the barrier.
>>>>>>>> 
>>>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>>>> out"
>>>>>>>> model.  When each nonzero rank process calls MPI_BARRIER, it sends a
>>>>>>>> message saying "I have hit the barrier!" (it usually sends it to its
>>>>>>>> parent in a tree of all MPI processes in the communicator, but you
>>>>>>>> can
>>>>>>>> simplify this model and consider that it sends it to rank 0).  Rank
>>>>>>>> 0
>>>>>>>> collects all of these messages.  When it has messages from all
>>>>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>>>>> barrier now" messages (again, it's usually via a tree distribution,
>>>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>>>> each
>>>>>>>> peer process in the communicator).
>>>>>>>> 
>>>>>>>> Hence, the time that any individual process spends in the
>>>>>>>> communicator
>>>>>>>> is relative to when every other process enters the communicator.
>>>>>>>> But
>>>>>>>> it's also dependent upon communication speed, congestion in the
>>>>>>>> network, etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>>>> 
>>>>>>>>> ===================
>>>>>>>>> start_time = MPI_Wtime()
>>>>>>>>> call MPI_BARRIER(...)
>>>>>>>>> new_time = MPI_Wtime() - start_time
>>>>>>>>> write(*,*) "barrier time =",new_time
>>>>>>>>> ==================
>>>>>>>>> 
>>>>>>>>> and then I run my code...
>>>>>>>>> 
>>>>>>>>> I expected that the values of "new_time" would range from 0 to Tmax
>>>>>>>>> (1700 in my case)
>>>>>>>>> As I understand it, the first process that hits the barrier should
>>>>>>>>> print Tmax and the last process that hits the barrier should print
>>>>>>>>> 0
>>>>>>>>> (or a very low value).
>>>>>>>>> 
>>>>>>>>> But this is not the case: all processes print values in the range
>>>>>>>>> 1400-1700!
>>>>>>>>> 
>>>>>>>>> Any explanation?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ghislain.
>>>>>>>>> 
>>>>>>>>> PS:
>>>>>>>>> This small code behaves perfectly in other parts of my code...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> jsquy...@cisco.com
>>>>>>>> For corporate legal information go to:
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquy...@cisco.com
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> 
>>>> | Teng Ma          Univ. of Tennessee |
>>>> | t...@cs.utk.edu        Knoxville, TN |
>>>> | http://web.eecs.utk.edu/~tma/       |
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> | Teng Ma          Univ. of Tennessee |
>> | t...@cs.utk.edu        Knoxville, TN |
>> | http://web.eecs.utk.edu/~tma/       |
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with MPI_BARRIER

Reply via email to