You'd better check process-core binding in your case. It looks to me P0 and P1 on the same node and P2 on another node, which makes ack to P0/P1 go through share memory and ack to P2 through networking. 1000x is very possible. sm latency can be about 0.03microsec. ethernet latency is about 20-30 microsec.
Just my guess...... Teng > Thanks, > > I understand this but the delays that I measure are huge compared to a > classical ack procedure... (1000x more) > And this is repeatable: as far as I understand it, this shows that the > network is not involved. > > Ghislain. > > > Le 8 sept. 2011 à 16:16, Teng Ma a écrit : > >> I guess you forget to count the "leaving time"(fan-out). When everyone >> hits the barrier, it still needs "ack" to leave. And remember in most >> cases, leader process will send out "acks" in a sequence way. It's very >> possible: >> >> P0 barrier time = 29 + send/recv ack 0 >> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >> >> That's your measure time. >> >> Teng >>> This problem as nothing to do with stdout... >>> >>> Example with 3 processes: >>> >>> P0 hits barrier at t=12 >>> P1 hits barrier at t=27 >>> P2 hits barrier at t=41 >>> >>> In this situation: >>> P0 waits 41-12 = 29 >>> P1 waits 41-27 = 14 >>> P2 waits 41-41 = 00 >> >> >> >>> So I should see something like (no ordering is expected): >>> barrier_time = 14 >>> barrier_time = 00 >>> barrier_time = 29 >>> >>> But what I see is much more like >>> barrier_time = 22 >>> barrier_time = 29 >>> barrier_time = 25 >>> >>> See? No process has a barrier_time equal to zero !!! >>> >>> >>> >>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit : >>> >>>> The order in which you see stdout printed from mpirun is not >>>> necessarily >>>> reflective of what order things were actually printers. Remember that >>>> the stdout from each MPI process needs to flow through at least 3 >>>> processes and potentially across the network before it is actually >>>> displayed on mpirun's stdout. >>>> >>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's >>>> stdout >>>> >>>> Hence, the ordering of stdout can get transposed. >>>> >>>> >>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote: >>>> >>>>> Thank you for this explanation but indeed this confirms that the LAST >>>>> process that hits the barrier should go through nearly >>>>> instantaneously >>>>> (except for the broadcast time for the acknowledgment signal). >>>>> And this is not what happens in my code : EVERY process waits for a >>>>> very long time before going through the barrier (thousands of times >>>>> more than a broadcast)... >>>>> >>>>> >>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit : >>>>> >>>>>> Order in which processes hit the barrier is only one factor in the >>>>>> time it takes for that process to finish the barrier. >>>>>> >>>>>> An easy way to think of a barrier implementation is a "fan in/fan >>>>>> out" >>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends a >>>>>> message saying "I have hit the barrier!" (it usually sends it to its >>>>>> parent in a tree of all MPI processes in the communicator, but you >>>>>> can >>>>>> simplify this model and consider that it sends it to rank 0). Rank >>>>>> 0 >>>>>> collects all of these messages. When it has messages from all >>>>>> processes in the communicator, it sends out "ok, you can leave the >>>>>> barrier now" messages (again, it's usually via a tree distribution, >>>>>> but you can pretend that it directly, linearly sends a message to >>>>>> each >>>>>> peer process in the communicator). >>>>>> >>>>>> Hence, the time that any individual process spends in the >>>>>> communicator >>>>>> is relative to when every other process enters the communicator. >>>>>> But >>>>>> it's also dependent upon communication speed, congestion in the >>>>>> network, etc. >>>>>> >>>>>> >>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> at a given point in my (Fortran90) program, I write: >>>>>>> >>>>>>> =================== >>>>>>> start_time = MPI_Wtime() >>>>>>> call MPI_BARRIER(...) >>>>>>> new_time = MPI_Wtime() - start_time >>>>>>> write(*,*) "barrier time =",new_time >>>>>>> ================== >>>>>>> >>>>>>> and then I run my code... >>>>>>> >>>>>>> I expected that the values of "new_time" would range from 0 to Tmax >>>>>>> (1700 in my case) >>>>>>> As I understand it, the first process that hits the barrier should >>>>>>> print Tmax and the last process that hits the barrier should print >>>>>>> 0 >>>>>>> (or a very low value). >>>>>>> >>>>>>> But this is not the case: all processes print values in the range >>>>>>> 1400-1700! >>>>>>> >>>>>>> Any explanation? >>>>>>> >>>>>>> Thanks, >>>>>>> Ghislain. >>>>>>> >>>>>>> PS: >>>>>>> This small code behaves perfectly in other parts of my code... >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> | Teng Ma Univ. of Tennessee | >> | t...@cs.utk.edu Knoxville, TN | >> | http://web.eecs.utk.edu/~tma/ | >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > | Teng Ma Univ. of Tennessee | | t...@cs.utk.edu Knoxville, TN | | http://web.eecs.utk.edu/~tma/ |