Re: [OMPI users] CPU burning in Wait state
On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote: I think I already read some comments on this issue, but I'd like to know of latest versions of OpenMPI have managed to solve it. I am now running 1.2.5 If I run a MPI program with synchronization routines (e.g. MPI_barrier, MPI_bcast...), all threads waiting for data are still burning CPU. On the other hand when using non-blocking receives all threads waiting for data are not consuming any CPU. Would there be a possibility to use MPI_Bcast without burning CPU power ? I'm afraid not at this time. We've talked about adding a blocking mode for progress, but it hasn't happened yet (and is very unlikely to happen for the v1.3 series). -- Jeff Squyres Cisco Systems
Re: [OMPI users] CPU burning in Wait state
Jeff Squyres wrote: On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote: I think I already read some comments on this issue, but I'd like to know of latest versions of OpenMPI have managed to solve it. I am now running 1.2.5 If I run a MPI program with synchronization routines (e.g. MPI_barrier, MPI_bcast...), all threads waiting for data are still burning CPU. On the other hand when using non-blocking receives all threads waiting for data are not consuming any CPU. Would there be a possibility to use MPI_Bcast without burning CPU power ? I'm afraid not at this time. We've talked about adding a blocking mode for progress, but it hasn't happened yet (and is very unlikely to happen for the v1.3 series). I'd like to understand this issue better. What about the variable mpi_yield_when_idle ? Is the point that this variable will cause a polling process to yield, but if there is no one to yield to then the process resumes burning CPU? If so, I can imagine this solution being sufficient in some cases but not in others. Also, Vincent, what do you mean by waiting threads not consuming any CPU for non-blocking receives? In what state are these threads? Are they in an MPI call (like MPI_Wait)? Or, have they returned from an MPI call (like MPI_Irecv) and the user application can then park these threads to the side?
Re: [OMPI users] CPU burning in Wait state
Eugene Loh wrote: Jeff Squyres wrote: On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote: I think I already read some comments on this issue, but I'd like to know of latest versions of OpenMPI have managed to solve it. I am now running 1.2.5 If I run a MPI program with synchronization routines (e.g. MPI_barrier, MPI_bcast...), all threads waiting for data are still burning CPU. On the other hand when using non-blocking receives all threads waiting for data are not consuming any CPU. Would there be a possibility to use MPI_Bcast without burning CPU power ? I'm afraid not at this time. We've talked about adding a blocking mode for progress, but it hasn't happened yet (and is very unlikely to happen for the v1.3 series). I'd like to understand this issue better. What about the variable mpi_yield_when_idle ? Is the point that this variable will cause a polling process to yield, but if there is no one to yield to then the process resumes burning CPU? If so, I can imagine this solution being sufficient in some cases but not in others. Also, Vincent, what do you mean by waiting threads not consuming any CPU for non-blocking receives? In what state are these threads? Are they in an MPI call (like MPI_Wait)? Or, have they returned from an MPI call (like MPI_Irecv) and the user application can then park these threads to the side? Dear Eugene The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Cheers Vincent ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] CPU burning in Wait state
Vincent Rotival wrote: The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Thanks for the clarification. But this strikes me more as a question about the MPI standard than about the Open MPI implementation. That is, what you really want is for the MPI API to support a non-blocking form of collectives. You want control to return to the user program before the barrier/bcast/etc. operation has completed. That's an API change.
Re: [OMPI users] CPU burning in Wait state
Eugene, No what I'd like is that when doing something like call mpi_bcast(data, 1, MPI_INTEGER, 0, .) the program continues AFTER the Bcast is completed (so no control returned to user), but while threads with rank > 0 are waiting in Bcast they are not taking CPU resources I hope it is more clear, I apologize for not being clear in the first place Vincent Eugene Loh wrote: Vincent Rotival wrote: The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Thanks for the clarification. But this strikes me more as a question about the MPI standard than about the Open MPI implementation. That is, what you really want is for the MPI API to support a non-blocking form of collectives. You want control to return to the user program before the barrier/bcast/etc. operation has completed. That's an API change. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] CPU burning in Wait state
On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote: Eugene, No what I'd like is that when doing something like call mpi_bcast(data, 1, MPI_INTEGER, 0, .) the program continues AFTER the Bcast is completed (so no control returned to user), but while threads with rank > 0 are waiting in Bcast they are not taking CPU resources Threads with rank > 0 ? Now, this scares me !!! If all your threads are going in the bcast, then I guess the application is not correct from the MPI standard perspective (i.e. on each communicator there is only one collective at every moment). In MPI, each process (and not each thread) has a rank, and each process exists in each communicator only once. In other words, as each collective is bounded to a specific communicator, on each of your processes, only one thread should go in the MPI_Bcast, if you want only ONE collective. george. I hope it is more clear, I apologize for not being clear in the first place Vincent Eugene Loh wrote: Vincent Rotival wrote: The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Thanks for the clarification. But this strikes me more as a question about the MPI standard than about the Open MPI implementation. That is, what you really want is for the MPI API to support a non-blocking form of collectives. You want control to return to the user program before the barrier/ bcast/etc. operation has completed. That's an API change. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] CPU burning in Wait state
Ok let's take the simple example here, I might have use wrong terms and I apologize for it While the rank 0 process is sleeping the other ones are in bcast waiting for data program test use mpi implicit none integer :: mpi_wsize, mpi_rank, mpi_err integer :: data call mpi_init(mpi_err) call mpi_comm_size(MPI_COMM_WORLD, mpi_wsize, mpi_err) call mpi_comm_rank(MPI_COMM_WORLD, mpi_rank, mpi_err) if(mpi_rank.eq.0) then call sleep(100) data = 10 end if call mpi_bcast(data, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpi_err) print *, "Done in #", mpi_rank, " => data=", data end program test George Bosilca wrote: On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote: Eugene, No what I'd like is that when doing something like call mpi_bcast(data, 1, MPI_INTEGER, 0, .) the program continues AFTER the Bcast is completed (so no control returned to user), but while threads with rank > 0 are waiting in Bcast they are not taking CPU resources Threads with rank > 0 ? Now, this scares me !!! If all your threads are going in the bcast, then I guess the application is not correct from the MPI standard perspective (i.e. on each communicator there is only one collective at every moment). In MPI, each process (and not each thread) has a rank, and each process exists in each communicator only once. In other words, as each collective is bounded to a specific communicator, on each of your processes, only one thread should go in the MPI_Bcast, if you want only ONE collective. george. I hope it is more clear, I apologize for not being clear in the first place Vincent Eugene Loh wrote: Vincent Rotival wrote: The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Thanks for the clarification. But this strikes me more as a question about the MPI standard than about the Open MPI implementation. That is, what you really want is for the MPI API to support a non-blocking form of collectives. You want control to return to the user program before the barrier/bcast/etc. operation has completed. That's an API change. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] CPU burning in Wait state
This program is 100% correct from MPI perspective. However, in Open MPI (and I think most of the others MPI), a collective communication is something that will drain most of the resources, similar to all blocking functions. Now I will answer to your original post. Using non blocking communications in this particular case, will give you a benefit as the data involved in the communications is small enough to achieve a perfect overlap. In the case you're trying to do exactly the same with larger data, using non blocking communications will negatively impact the performances, as MPI is not supposed to communicate when the user application is not in an MPI call. george. On Sep 3, 2008, at 6:32 PM, Vincent Rotival wrote: Ok let's take the simple example here, I might have use wrong terms and I apologize for it While the rank 0 process is sleeping the other ones are in bcast waiting for data program test use mpi implicit none integer :: mpi_wsize, mpi_rank, mpi_err integer :: data call mpi_init(mpi_err) call mpi_comm_size(MPI_COMM_WORLD, mpi_wsize, mpi_err) call mpi_comm_rank(MPI_COMM_WORLD, mpi_rank, mpi_err) if(mpi_rank.eq.0) then call sleep(100) data = 10 end if call mpi_bcast(data, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpi_err) print *, "Done in #", mpi_rank, " => data=", data end program test George Bosilca wrote: On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote: Eugene, No what I'd like is that when doing something like call mpi_bcast(data, 1, MPI_INTEGER, 0, .) the program continues AFTER the Bcast is completed (so no control returned to user), but while threads with rank > 0 are waiting in Bcast they are not taking CPU resources Threads with rank > 0 ? Now, this scares me !!! If all your threads are going in the bcast, then I guess the application is not correct from the MPI standard perspective (i.e. on each communicator there is only one collective at every moment). In MPI, each process (and not each thread) has a rank, and each process exists in each communicator only once. In other words, as each collective is bounded to a specific communicator, on each of your processes, only one thread should go in the MPI_Bcast, if you want only ONE collective. george. I hope it is more clear, I apologize for not being clear in the first place Vincent Eugene Loh wrote: Vincent Rotival wrote: The solution I retained was for the main thread to isend data separately to each other threads that are using Irecv + loop on mpi_test to test the finish of the Irecv. It mught be dirty but works much better than using Bcast Thanks for the clarification. But this strikes me more as a question about the MPI standard than about the Open MPI implementation. That is, what you really want is for the MPI API to support a non-blocking form of collectives. You want control to return to the user program before the barrier/bcast/etc. operation has completed. That's an API change. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] CPU burning in Wait state
Vincent 1) Assume you are running an MPI program which has 16 tasks in MPI_COMM_WORLD, you have 16 dedicated CPUs and each task is single threaded. (a task is a distinct process, a process can contain one or more threads) The is the most common traditional model. In this model, when a task makes a blocking call, the CPU is used to poll the communication layer. With only one thread per task, there is no way the CPU can be given other useful work because the only thread is in the MPI_Bast and not available to compute. With nothing else for the CPU to do anyway, it may as well poll because that is likely to complete the blocking operation in shortest time. Polling is the right choice. You should not worry that the CPU is being "burned". It will not wear out. 2) Now assume you have the same number of tasks and CPUs but you have provided a compute thread and a communication thread in each task. At the moment you make an MPI_Bcast call on each task's communication thread you have unfinished computation that the CPUs could process on the compute threads. In this case you want the CPU to be released by the blocked MPI_Bcast so it can be used by the compute thread. The MPI_Bcast may take longer to complete because it is not burning the CPU but if useful computation is going forward you come out ahead. A non-polling mode for the blocking MPI_Bcast is the better option. 3) Take a third case - the CPUs are not dedicated to your MPI job. You have only one thread per task but when that thread is blocked in an MPI_Bcast you want other processes to be able to run. This is not a common situation in production environments but may be common in learning or development situations. Perhaps your MPI homework problem is running at the same time someone else is trying to compile theirs on the same nodes. In this case you really do not need the MPI_Bcast to finish in the shortest possible time and you do want the people who share the node with you to quit complaining. Again. a non-polling mode than gives up the CPU and lets your neighbors compilation run is best. Which of these is closest to your situation? If it is situation 1, why would you care that CPU is burning? If it is situation 2 or 3 then you do have reason to care. Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 09/03/2008 01:11:00 PM: > [image removed] > > Re: [OMPI users] CPU burning in Wait state > > Vincent Rotival > > to: > > Open MPI Users > > 09/03/2008 01:15 PM > > Sent by: > > users-boun...@open-mpi.org > > Please respond to Open MPI Users > > Eugene, > > No what I'd like is that when doing something like > > call mpi_bcast(data, 1, MPI_INTEGER, 0, .) > > the program continues AFTER the Bcast is completed (so no control > returned to user), but while threads with rank > 0 are waiting in Bcast > they are not taking CPU resources > > I hope it is more clear, I apologize for not being clear in the first place > > Vincent > > > > Eugene Loh wrote: > > > > Vincent Rotival wrote: > > > >> The solution I retained was for the main thread to isend data > >> separately to each other threads that are using Irecv + loop on > >> mpi_test to test the finish of the Irecv. It mught be dirty but > >> works much better than using Bcast > > > > Thanks for the clarification. > > > > But this strikes me more as a question about the MPI standard than > > about the Open MPI implementation. That is, what you really want is > > for the MPI API to support a non-blocking form of collectives. You > > want control to return to the user program before the > > barrier/bcast/etc. operation has completed. That's an API change. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] CPU burning in Wait state
I hope the following helps, but maybe I'm just repeating myself and Dick. Let's say you're stuck in an MPI_Recv, MPI_Bcast, or MPI_Barrier call waiting on someone else. You want to free up the CPU for more productive purposes. There are basically two cases: 1) If you want to free the CPU up for the calling thread, the main trick is returning program control to the caller. This requires a non-blocking MPI call. There is such a thing for MPI_Recv (it's MPI_Irecv, you know how to use it), but no such thing for MPI_Bcast or MPI_Barrier. Anyhow, given a non-blocking call, you can return control to the caller, who can do productive work while occasionally testing for completion of the original operation. 2) If you want to free the CPU up for anyone else, what you want is that the MPI implementation should not poll hard while it's waiting. You can do that in Open MPI with the "mpi_yield_when_idle=1" variable. E.g., % setenv OMPI_MCA_mpi_yield_when_idle 1 % mpirun a.out or % mpirun --mca mpi_yield_when_idle 1 a.out I'm not sure about all systems, but I think yield might sometimes be observable only if there is someone to yield to. It's like driving into a traffic circle. You're supposed to yield to cars already in the circle. This makes a difference only if there is someone in the circle! Similarly, if you look at whether Open MPI is polling hard, you might see that it is, indeed, polling hard even if you turn yield on. The real test is to have another process compete for the same CPU. You should see the MPI process and the competing process share the CPU in the default case, but the competing process winning the CPU when yield is turned on. I tried such a test on my system and confirmed that Open MPI yield does "work". I hope that helps.
Re: [OMPI users] CPU burning in Wait state
As usual, Dick is much more eloquent than me. :-) He also correctly pointed out to me in an off-list mail that in my first reply, I casually used the internal term "blocking progress" and probably sowed some of the initial seeds of confusion in this thread (because "blocking" has specific meaning in MPI parlance). Sorry about that. What I should have said is that we have on our to-do list to effect a non-polling model of making message passing progress. As has been stated several times on this thread, OMPI currently polls for message passing progress. While you're in MPI_BCAST, it's quite possible/ likely that OMPI will poll hard until the BCAST is done. It is possible that a future version of OMPI will use a hybrid polling+non- polling approach for progress, such that if you call MPI_BCAST, we'll poll for a while. And if nothing "interesting" happens after a while (i.e., the BCAST hasn't finished and nothing else seems to be happening), we'll allow OMPI's internal progression engine to block/go to sleep until something interesting happens. We casually refer to this as "blocking progress" in OMPI developer circles, but we mean it in a very different way than the traditional "blocking" meaning for MPI communication. Again, sorry about the confusion -- hopefully all the followups in this thread cleared up the issue. On Sep 3, 2008, at 7:17 PM, Richard Treumann wrote: Vincent 1) Assume you are running an MPI program which has 16 tasks in MPI_COMM_WORLD, you have 16 dedicated CPUs and each task is single threaded. (a task is a distinct process, a process can contain one or more threads) The is the most common traditional model. In this model, when a task makes a blocking call, the CPU is used to poll the communication layer. With only one thread per task, there is no way the CPU can be given other useful work because the only thread is in the MPI_Bast and not available to compute. With nothing else for the CPU to do anyway, it may as well poll because that is likely to complete the blocking operation in shortest time. Polling is the right choice. You should not worry that the CPU is being "burned". It will not wear out. 2) Now assume you have the same number of tasks and CPUs but you have provided a compute thread and a communication thread in each task. At the moment you make an MPI_Bcast call on each task's communication thread you have unfinished computation that the CPUs could process on the compute threads. In this case you want the CPU to be released by the blocked MPI_Bcast so it can be used by the compute thread. The MPI_Bcast may take longer to complete because it is not burning the CPU but if useful computation is going forward you come out ahead. A non-polling mode for the blocking MPI_Bcast is the better option. 3) Take a third case - the CPUs are not dedicated to your MPI job. You have only one thread per task but when that thread is blocked in an MPI_Bcast you want other processes to be able to run. This is not a common situation in production environments but may be common in learning or development situations. Perhaps your MPI homework problem is running at the same time someone else is trying to compile theirs on the same nodes. In this case you really do not need the MPI_Bcast to finish in the shortest possible time and you do want the people who share the node with you to quit complaining. Again. a non-polling mode than gives up the CPU and lets your neighbors compilation run is best. Which of these is closest to your situation? If it is situation 1, why would you care that CPU is burning? If it is situation 2 or 3 then you do have reason to care. Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 09/03/2008 01:11:00 PM: > [image removed] > > Re: [OMPI users] CPU burning in Wait state > > Vincent Rotival > > to: > > Open MPI Users > > 09/03/2008 01:15 PM > > Sent by: > > users-boun...@open-mpi.org > > Please respond to Open MPI Users > > Eugene, > > No what I'd like is that when doing something like > > call mpi_bcast(data, 1, MPI_INTEGER, 0, .) > > the program continues AFTER the Bcast is completed (so no control > returned to user), but while threads with rank > 0 are waiting in Bcast > they are not taking CPU resources > > I hope it is more clear, I apologize for not being clear in the first place > > Vincent > > > > Eugene Loh wrote: > > > > Vincent Rotival wrote: > > > >> The solution I retained was for the main thread to isend data > >> separately to each other threads that are using Irecv + loop on > >> mpi_test to test the finish of the Irecv. It mught be dirty but > >> works much better than using Bcast > > > > Thanks for the clarification. > > > > But this strikes me more as a question about the