Re: [OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core

Jeff Squyres (jsquyres) Tue, 20 Aug 2013 13:18:29 -0400 (EDT)

I know I'm late to this conversation, but I was on vacation last week.  Some 
random points:


1. If you use OMPI's --bind-to-core option and then re-bind yourself to some 
other core, then all the memory affinity that MPI setup during MPI_Init() will 
be "wrong" (possibly on a remote numa node).  I would advise against doing this.

2. Instead of #1, as Ralph stated, if you're going to do your own process 
affinity, then don't use OMPI's --bind-to-core (or any --bind-to-* option).  
Then MPI won't setup any affinity stuff, and you're good.

3. Rather that setting up cpu shielding, you can just use simple API calls or 
scripting calls to bind each MPI process to wherever you want.  For example:

$ mpirun --host a,b -np 4 my_binding_script.sh my_mpi_app

Where my_binding_script.sh simply invokes a tool like hwloc-bind to bind 
yourself to whatever socket/core combination you want, and then invokes 
my_mpi_app (i.e., the real MPI application).  For example:

$ cat my_binding_script.sh
#!/bin/sh
exec hwloc-bind socket.1:core.$OMPI_COMM_WORLD_LOCAL_RANK $1

Where $OMPI_COMM_WORLD_LOCAL_RANK is an environment variable that mpirun will 
put in the environment of the processes that it launches.  Each process will 
have $OMPI_COMM_WORLD_LOCAL_RANK set to a value in the range of [0,N), where N 
is the number processes on that server.  In the above example of launching 4 
processes (2 on each server a and b), each of the 4 processes would get an 
$OMPI_COMM_WORLD_LOCAL_RANK value of 0 or 1.

If you don't know about hwloc, you should -- it's very, very helpful for all 
this kind of process affinity stuff.  See 
http://www.open-mpi.org/projects/hwloc/ (hwloc-bind is one of the tools in the 
hwloc suite).


On Aug 18, 2013, at 7:01 PM, Siddhartha Jana <siddharthajan...@gmail.com> wrote:

> Noted. Thanks again
> -- Sid
> 
> 
> On 18 August 2013 18:40, Ralph Castain <r...@open-mpi.org> wrote:
> It only has to come after MPI_Init *if* you are telling mpirun to bind you as 
> well. Otherwise, you could just not tell mpirun to bind (it doesn't by 
> default) and then bind anywhere, anytime you like
> 
> 
> On Aug 18, 2013, at 3:24 PM, Siddhartha Jana <siddharthajan...@gmail.com> 
> wrote:
> 
>> 
>> A process can always change its binding by "re-binding" to wherever it wants 
>> after MPI_Init completes.
>> Noted. Thanks. I guess the important thing that I wanted to know was that 
>> the binding needs to happen *after* MPI_Init() completes. 
>> 
>> Thanks all
>> 
>> -- Siddhartha
>> 
>>  
>> 
>> 
>> On Aug 18, 2013, at 9:38 AM, Siddhartha Jana <siddharthajan...@gmail.com> 
>> wrote:
>> 
>>> Firstly, I would like my program to dynamically assign it self to one of 
>>> the cores it pleases and remain bound to it until it later reschedules 
>>> itself.
>>> 
>>> Ralph Castain wrote:
>>> >> "If you just want mpirun to respect an external cpuset limitation, it 
>>> >> already does so when binding - it will bind within the external 
>>> >> limitation"
>>> 
>>> In my case, the limitation is enforced "internally", by the application 
>>> once in begins execution. I enforce this during program execution, after 
>>> the mpirun has finished "binding within the external limitation". 
>>> 
>>> 
>>> Brice Goglin said:
>>> >>  "MPI can bind at two different times: inside mpirun after ssh before 
>>> >> running the actual program (this one would ignore your cpuset), later at 
>>> >> MPI_Init inside your program (this one will ignore your cpuset only if 
>>> >> you call MPI_Init before creating the cpuset)."
>>> 
>>> Noted. In that case, during program execution, whose binding is respected - 
>>> mpirun's or MPI_Init()'s? From the above, is my understanding correct? That 
>>> MPI_Init() will be responsible for the 2nd round of attempting to bind 
>>> processes to cores and can override what mpirun or the programmer had 
>>> enforced before its call (using hwloc/cpuset/sched_load_balance() and other 
>>> compatible cousins) ? 
>>> 
>>> 
>>> --------------------------------------------
>>> If this is so, in my case the flow of events is thus:
>>> 
>>> 1. mpirun binds an MPI process which is yet to begin execution. So mpirun 
>>> says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do 
>>> use the --bind-to-core flag) 
>>> 
>>> 2. Process begins execution on core A
>>> 
>>> 3. I enforce: "Bind to core B". (we must remember, it is only at runtime 
>>> that  I know what core I want to be bound to and not while launching the 
>>> processes using mpirun). So my process shifts over to core B
>>> 
>>> 4. MPI_Init() once again honors rankfile mapping(if any, default policy, 
>>> otherwise ) and rebinds my process to core A
>>> 
>>> 5. process finished execution and calls MPI_Finalize(), all the time on 
>>> core A
>>> 
>>> 6. mpirun exits
>>> --------------------------------------------
>>> 
>>> So if I place step-3 above after step-4, my request will hold for the rest 
>>> of the execution. Please do let me know, if my understanding is correct.
>>> 
>>> Thanks for all the help
>>> 
>>> Sincerely,
>>> Siddhartha Jana
>>> HPCTools
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 18 August 2013 10:49, Ralph Castain <r...@open-mpi.org> wrote:
>>> If you require that a specific rank go to a specific core, then use the 
>>> rankfile mapper - you can see explanations on the syntax in "man mpirun"
>>> 
>>> If you just want mpirun to respect an external cpuset limitation, it 
>>> already does so when binding - it will bind within the external limitation
>>> 
>>> 
>>> On Aug 18, 2013, at 6:09 AM, Siddhartha Jana <siddharthajan...@gmail.com> 
>>> wrote:
>>> 
>>>> So my question really boils down to:
>>>> How does one ensure that mpirun launches the processes on the "specific" 
>>>> cores that are expected of them to be bound to. 
>>>> As I mentioned, if there were a way to specify the cores through the 
>>>> hostfile, this problem should be solved. 
>>>> 
>>>> Thanks for all the quick replies,
>>>> -- Sid
>>>> 
>>>> On 18 August 2013 09:04, Siddhartha Jana <siddharthajan...@gmail.com> 
>>>> wrote:
>>>> Thanks John. But I have an incredibly small system. 2 nodes - 16 cores 
>>>> each.
>>>> 2-4 MPI processes. :-)
>>>> 
>>>> On 18 August 2013 09:03, John Hearns <hear...@googlemail.com> wrote:
>>>> You really should install a job scheduler.
>>>> There are free versions.
>>>> 
>>>> I'm not sure about cpuset support in Gridengine. Anyone?
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core

Reply via email to