Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-09 Thread Thatyene Louise Alves de Souza Ramos
Edgar, I forgot to answer your previous question. I used MPI 1.5.4 and the C++ API. Thatyene Ramos On Mon, Apr 9, 2012 at 8:00 PM, Thatyene Louise Alves de Souza Ramos < thaty...@gmail.com> wrote: > Hi Edgar, sorry about the late response. I've been travelling without > Internet access. > > Wel

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-09 Thread Thatyene Louise Alves de Souza Ramos
Hi Edgar, sorry about the late response. I've been travelling without Internet access. Well, I took the code Rodrigo provided and modified the client to make the dup after the creation of the new inter communicator, without 1 process. That is, I just replaced the lines 54-55 in the *removeRank* me

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-05 Thread Edgar Gabriel
so just to confirm, I ran our test suite for inter-communicator collective operations and communicator duplication, and everything still works. Specifically comm_dup on an intercommunicator is not fundamentally broken, but worked for my tests. Having your code to see what your code precisely does

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-05 Thread Edgar Gabriel
can you please send me your testcode? Thanks Edgar On 4/4/2012 3:09 PM, Thatyene Louise Alves de Souza Ramos wrote: > Hi Edgar, thank you for the response. > > Unfortunately, I've tried with and without this option. In both the > result was the same... =( > > On Wed, Apr 4, 2012 at 5:04 PM, Edga

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-04 Thread Thatyene Louise Alves de Souza Ramos
Hi Edgar, thank you for the response. Unfortunately, I've tried with and without this option. In both the result was the same... =( On Wed, Apr 4, 2012 at 5:04 PM, Edgar Gabriel wrote: > did you try to start the program with the --mca coll ^inter switch that > I mentioned? Collective dup for in

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-04 Thread Edgar Gabriel
did you try to start the program with the --mca coll ^inter switch that I mentioned? Collective dup for intercommunicators should work, its probably again the bcast over a communicator of size 1 that is causing the hang, and you could avoid it with the flag that I mentioned above. Also, if you cou

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-04-04 Thread Thatyene Louise Alves de Souza Ramos
Hi there. I've made some tests related to the problem reported by Rodrigo. And I think, I'd rather be wrong, that *collective calls like Create and Dup do not work with Inter communicators. I've try this in the client group:* *MPI::Intercomm tmp_inter_comm;* * * *tmp_inter_comm = server_comm.Crea

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-28 Thread Edgar Gabriel
it just uses a different algorithm which avoids the bcast on a communicator of 1 (which is causing the problem here). Thanks Edgar On 3/28/2012 12:08 PM, Rodrigo Oliveira wrote: > Hi Edgar, > > I tested the execution of my code using the option -mca coll ^inter as > you suggested and the program

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-28 Thread Rodrigo Oliveira
Hi Edgar, I tested the execution of my code using the option -mca coll ^inter as you suggested and the program worked fine, even when I use 1 server instance. What is the modification caused by this parameter? I did not find an explanation about the utilization of the module coll inter. Thanks a

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-27 Thread Rodrigo Oliveira
Hi Edgar. Thanks for the response. I just did not understand why the Barrier works before I remove one of the client processes. I tryed it with 1 server and 3 clients and it worked properly. After I removed 1 of the clients, it stops working. So, the removal is affecting the functionality of Barr

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-26 Thread Edgar Gabriel
yes and no,. So first, here is a quick fix for you: if you start the server using mpirun -np 2 -mca coll ^inter ./server your test code finishes (with one minor modification to your code, namely the process that is being excluded on the client side does need a condition to leave the while loop as

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-26 Thread Rodrigo Oliveira
Hi Edgar, Did you take a look at my code? Any idea about what is happening? I did a lot of tests and it does not work. Thanks On Tue, Mar 20, 2012 at 3:43 PM, Rodrigo Oliveira wrote: > The command I use to compile and run is: > > mpic++ server.cc -o server && mpic++ client.cc -o client && mpir

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-20 Thread Rodrigo Oliveira
The command I use to compile and run is: mpic++ server.cc -o server && mpic++ client.cc -o client && mpirun -np 1 ./server Rodrigo On Tue, Mar 20, 2012 at 3:40 PM, Rodrigo Oliveira wrote: > Hi Edgar. > > Thanks for the response. The simplified code is attached: server, client > and a .h contai

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-20 Thread Rodrigo Oliveira
Hi Edgar. Thanks for the response. The simplified code is attached: server, client and a .h containing some constants. I put some "prints" to show the behavior. Regards Rodrigo On Tue, Mar 20, 2012 at 11:47 AM, Edgar Gabriel wrote: > do you have by any chance the actual or a small reproducer

Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)

2012-03-20 Thread Edgar Gabriel
do you have by any chance the actual or a small reproducer? It might be much easier to hunt the problem down... Thanks Edgar On 3/19/2012 8:12 PM, Rodrigo Oliveira wrote: > Hi there. > > I am facing a very strange problem when using MPI_Barrier over an > inter-communicator after some operations

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-12 Thread Ghislain Lartigue
Thank you: this is very enlightening. I will try this and let you know... Ghislain. Le 9 sept. 2011 à 18:00, Eugene Loh a écrit : > > > On 9/8/2011 11:47 AM, Ghislain Lartigue wrote: >> I guess you're perfectly right! >> I will try to test it tomorrow by putting a call system("wait(X)) befor t

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-09 Thread Eugene Loh
On 9/8/2011 11:47 AM, Ghislain Lartigue wrote: I guess you're perfectly right! I will try to test it tomorrow by putting a call system("wait(X)) befor the barrier! What does "wait(X)" mean? Anyhow, here is how I see your computation: A) The first barrier simply synchronizes the processes.

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
I guess you're perfectly right! I will try to test it tomorrow by putting a call system("wait(X)) befor the barrier! Thanks, Ghislain. PS: if anyone has more information about the implementation of the MPI_IRECV() procedure, I would be glad to learn more about it! Le 8 sept. 2011 à 17:35, Eug

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh
I should know OMPI better than I do, but generally, when you make an MPI call, you could be diving into all kinds of other stuff. E.g., with non-blocking point-to-point operations, a message might make progress during another MPI call. E.g., MPI_Irecv(recv_req) MPI_Isend(send_req) MPI_Wait(s

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
If barrier/time/barrier/time solves your problem in each measure, that means your computation above/below your barrier is not too "synchronized". Their overhead is diverse for each process. on 2nd/3rd/... round, the time to enter barrier is too diverse, maybe range from [1, 1400]. This Barrier bec

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
This behavior happens at every call (first and following) Here is my code (simplified): start_time = MPI_Wtime() call mpi_ext_barrier() new_time = MPI_Wtime()-start_time write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
do barrier/time/barrier/time and run your code again. Teng > I will check that, but as I said in first email, this strange behaviour > happens only in one place in my code. > I have the same time/barrier/time procedure in other places (in the same > code) and it works perfectly. > > At one place

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh
On 9/8/2011 7:42 AM, Ghislain Lartigue wrote: I will check that, but as I said in first email, this strange behaviour happens only in one place in my code. Is the strange behavior on the first time, or much later on? (You seem to imply later on, but I thought I'd ask.) I agree the behavior i

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
and to fix things, the units I use are not the direct result of MPI_Wtime(): new_time = (MPI_Wtime()-start_time)*1e9/(36^3) This means that you should multiply these times by ~20'000 to have ticks.. Le 8 sept. 2011 à 16:42, Ghislain Lartigue a écrit : > I will check that, but as I said in first

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
I will check that, but as I said in first email, this strange behaviour happens only in one place in my code. I have the same time/barrier/time procedure in other places (in the same code) and it works perfectly. At one place I have the following output (sorted) <00>(0) CAST GHOST DATA1 LOOP 1 b

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh
I agree sentimentally with Ghislain. The time spent in a barrier should conceptually be some wait time, which can be very long (possibly on the order of milliseconds or even seconds), and the time to execute the barrier operations, which should essentially be "instantaneous" on some time scal

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
You'd better check process-core binding in your case. It looks to me P0 and P1 on the same node and P2 on another node, which makes ack to P0/P1 go through share memory and ack to P2 through networking. 1000x is very possible. sm latency can be about 0.03microsec. ethernet latency is about 20-30 m

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Jai Dayal
what tick value are you using (i.e., what units are you using?) On Thu, Sep 8, 2011 at 10:25 AM, Ghislain Lartigue < ghislain.larti...@coria.fr> wrote: > Thanks, > > I understand this but the delays that I measure are huge compared to a > classical ack procedure... (1000x more) > And this is repe

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
Thanks, I understand this but the delays that I measure are huge compared to a classical ack procedure... (1000x more) And this is repeatable: as far as I understand it, this shows that the network is not involved. Ghislain. Le 8 sept. 2011 à 16:16, Teng Ma a écrit : > I guess you forget to

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
I guess you forget to count the "leaving time"(fan-out). When everyone hits the barrier, it still needs "ack" to leave. And remember in most cases, leader process will send out "acks" in a sequence way. It's very possible: P0 barrier time = 29 + send/recv ack 0 P1 barrier time = 14 + send ack 0

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Rolf Riesen
On Thu Sep 8, 2011 15:41:57, Ghislain Lartigue wrote: > Ghislain These "times" have no units, it's just an example... > Ghislain Whatever units are used, at least one process should spend a very small of time in the barrier (compared to the other processes) and this is not what I see in

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
These "times" have no units, it's just an example... Whatever units are used, at least one process should spend a very small of time in the barrier (compared to the other processes) and this is not what I see in my code. The network is supposed to be excellent: my machine is #9 in the top500 su

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Jeff Squyres
On Sep 8, 2011, at 9:17 AM, Ghislain Lartigue wrote: > Example with 3 processes: > > P0 hits barrier at t=12 > P1 hits barrier at t=27 > P2 hits barrier at t=41 What is the unit of time here, and how well are these times synchronized? > In this situation: > P0 waits 41-12 = 29 > P1 waits 41-27

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
This problem as nothing to do with stdout... Example with 3 processes: P0 hits barrier at t=12 P1 hits barrier at t=27 P2 hits barrier at t=41 In this situation: P0 waits 41-12 = 29 P1 waits 41-27 = 14 P2 waits 41-41 = 00 So I should see something like (no ordering is expected): barrier_time =

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Jeff Squyres
The order in which you see stdout printed from mpirun is not necessarily reflective of what order things were actually printers. Remember that the stdout from each MPI process needs to flow through at least 3 processes and potentially across the network before it is actually displayed on mpirun

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Ghislain Lartigue
Thank you for this explanation but indeed this confirms that the LAST process that hits the barrier should go through nearly instantaneously (except for the broadcast time for the acknowledgment signal). And this is not what happens in my code : EVERY process waits for a very long time before go

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Jeff Squyres
Order in which processes hit the barrier is only one factor in the time it takes for that process to finish the barrier. An easy way to think of a barrier implementation is a "fan in/fan out" model. When each nonzero rank process calls MPI_BARRIER, it sends a message saying "I have hit the bar