I will check that, but as I said in first email, this strange behaviour happens only in one place in my code. I have the same time/barrier/time procedure in other places (in the same code) and it works perfectly.
At one place I have the following output (sorted) <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3 . . . <00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3 <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3 but in the place that concerns me I have (sorted) <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1386.9 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1401.5 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1412.9 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1414.1 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1419.6 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1428.1 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1430.4 . . . <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1632.7 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1635.7 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1660.6 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1685.1 <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1699.2 These are the same units... You see that in the first place, the "time" to "hit/wait/leave" can be very small compared to the last output... Le 8 sept. 2011 à 16:35, Teng Ma a écrit : > You'd better check process-core binding in your case. It looks to me P0 > and P1 on the same node and P2 on another node, which makes ack to P0/P1 > go through share memory and ack to P2 through networking. > 1000x is very possible. sm latency can be about 0.03microsec. ethernet > latency is about 20-30 microsec. > > Just my guess...... > > Teng >> Thanks, >> >> I understand this but the delays that I measure are huge compared to a >> classical ack procedure... (1000x more) >> And this is repeatable: as far as I understand it, this shows that the >> network is not involved. >> >> Ghislain. >> >> >> Le 8 sept. 2011 à 16:16, Teng Ma a écrit : >> >>> I guess you forget to count the "leaving time"(fan-out). When everyone >>> hits the barrier, it still needs "ack" to leave. And remember in most >>> cases, leader process will send out "acks" in a sequence way. It's very >>> possible: >>> >>> P0 barrier time = 29 + send/recv ack 0 >>> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >>> >>> That's your measure time. >>> >>> Teng >>>> This problem as nothing to do with stdout... >>>> >>>> Example with 3 processes: >>>> >>>> P0 hits barrier at t=12 >>>> P1 hits barrier at t=27 >>>> P2 hits barrier at t=41 >>>> >>>> In this situation: >>>> P0 waits 41-12 = 29 >>>> P1 waits 41-27 = 14 >>>> P2 waits 41-41 = 00 >>> >>> >>> >>>> So I should see something like (no ordering is expected): >>>> barrier_time = 14 >>>> barrier_time = 00 >>>> barrier_time = 29 >>>> >>>> But what I see is much more like >>>> barrier_time = 22 >>>> barrier_time = 29 >>>> barrier_time = 25 >>>> >>>> See? No process has a barrier_time equal to zero !!! >>>> >>>> >>>> >>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit : >>>> >>>>> The order in which you see stdout printed from mpirun is not >>>>> necessarily >>>>> reflective of what order things were actually printers. Remember that >>>>> the stdout from each MPI process needs to flow through at least 3 >>>>> processes and potentially across the network before it is actually >>>>> displayed on mpirun's stdout. >>>>> >>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's >>>>> stdout >>>>> >>>>> Hence, the ordering of stdout can get transposed. >>>>> >>>>> >>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote: >>>>> >>>>>> Thank you for this explanation but indeed this confirms that the LAST >>>>>> process that hits the barrier should go through nearly >>>>>> instantaneously >>>>>> (except for the broadcast time for the acknowledgment signal). >>>>>> And this is not what happens in my code : EVERY process waits for a >>>>>> very long time before going through the barrier (thousands of times >>>>>> more than a broadcast)... >>>>>> >>>>>> >>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit : >>>>>> >>>>>>> Order in which processes hit the barrier is only one factor in the >>>>>>> time it takes for that process to finish the barrier. >>>>>>> >>>>>>> An easy way to think of a barrier implementation is a "fan in/fan >>>>>>> out" >>>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends a >>>>>>> message saying "I have hit the barrier!" (it usually sends it to its >>>>>>> parent in a tree of all MPI processes in the communicator, but you >>>>>>> can >>>>>>> simplify this model and consider that it sends it to rank 0). Rank >>>>>>> 0 >>>>>>> collects all of these messages. When it has messages from all >>>>>>> processes in the communicator, it sends out "ok, you can leave the >>>>>>> barrier now" messages (again, it's usually via a tree distribution, >>>>>>> but you can pretend that it directly, linearly sends a message to >>>>>>> each >>>>>>> peer process in the communicator). >>>>>>> >>>>>>> Hence, the time that any individual process spends in the >>>>>>> communicator >>>>>>> is relative to when every other process enters the communicator. >>>>>>> But >>>>>>> it's also dependent upon communication speed, congestion in the >>>>>>> network, etc. >>>>>>> >>>>>>> >>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> at a given point in my (Fortran90) program, I write: >>>>>>>> >>>>>>>> =================== >>>>>>>> start_time = MPI_Wtime() >>>>>>>> call MPI_BARRIER(...) >>>>>>>> new_time = MPI_Wtime() - start_time >>>>>>>> write(*,*) "barrier time =",new_time >>>>>>>> ================== >>>>>>>> >>>>>>>> and then I run my code... >>>>>>>> >>>>>>>> I expected that the values of "new_time" would range from 0 to Tmax >>>>>>>> (1700 in my case) >>>>>>>> As I understand it, the first process that hits the barrier should >>>>>>>> print Tmax and the last process that hits the barrier should print >>>>>>>> 0 >>>>>>>> (or a very low value). >>>>>>>> >>>>>>>> But this is not the case: all processes print values in the range >>>>>>>> 1400-1700! >>>>>>>> >>>>>>>> Any explanation? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ghislain. >>>>>>>> >>>>>>>> PS: >>>>>>>> This small code behaves perfectly in other parts of my code... >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> | Teng Ma Univ. of Tennessee | >>> | t...@cs.utk.edu Knoxville, TN | >>> | http://web.eecs.utk.edu/~tma/ | >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > | Teng Ma Univ. of Tennessee | > | t...@cs.utk.edu Knoxville, TN | > | http://web.eecs.utk.edu/~tma/ | > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >