Re: [OMPI users] Bindings not detected with slurm (srun)

Ralph Castain Mon, 22 Aug 2011 09:49:50 -0400

Okay - thx! I'll install in trunk and schedule for 1.5


On Aug 22, 2011, at 7:20 AM, pascal.dev...@bull.net wrote:

> 
> users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 :
> 
>> De : Ralph Castain <r...@open-mpi.org>
>> A : Open MPI Users <us...@open-mpi.org>
>> Date : 18/08/2011 14:45
>> Objet : Re: [OMPI users] Bindings not detected with slurm (srun)
>> Envoyé par : users-boun...@open-mpi.org
>> 
>> Afraid I am confused. I assume this refers to the trunk, yes?
> 
> I work with V1.5.
> 
>> 
>> I also assume you are talking about launching an application
>> directly from srun as opposed to using mpirun - yes?
> 
> Yes
> 
>> 
>> In that case, I fail to understand what difference it makes
>> regarding this proposed change. The application process is being
>> directly bound by slurm, so what paffinity thinks is irrelevant,
>> except perhaps for some debugging I suppose. Is that what you are
>> concerned about?
> 
> I have a framework that has to check if the processes are bound. This
> framework
> uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all
> processes are bound.
> 
> That runs well except when I use srun with slurm configured to bind
> each single rank with a singleton.
> 
> For exemple, I use nodes with 8 sockets of 4 cores. The command srun
> generates 32 cpusets (one for each core) and binds the 32 processes, one
> on each cpuset.
> Then the macro returns *bound=false, and my framework considers that the
> processes are not bound  and doesn't do the job correctly.
> 
> The patch modifies the macro to return *bound=true when a single
> process is bound to a cpuset of one core.
> 
>> 
>> I'd just like to know what problem is actually being solved here. I
>> agree that, if there is only one processor in a system, you are
>> effectively "bound".
>> 
>> 
>> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote:
>> 
>>> Hi all,
>>> 
>>> When slurm is configured with the following parameters
>>>  TaskPlugin=task/affinity
>>>  TaskPluginParam=Cpusets
>>> srun binds the processes by placing them into different
>>> cpusets, each containing a single core.
>>> 
>>> e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two
> allocated
>>> nodes and place the four ranks there, each single rank with a singleton
> as
>>> a cpu constraint.
>>> 
>>> The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND
> (in
>>> opal/mca/paffinity/paffinity.h):
>>> . opal_paffinity_base_get_processor_info() fills in num_processors
> with 1
>>> (this is the size of each cpu_set)
>>> . num_bound is set to 1 too
>>> and this implies *bound=false
>>> 
>>> So, the binding is correctly done by slurm and not detected by MPI.
>>> 
>>> To support the cpuset binding done by slurm, I propose the following
> patch:
>>> 
>>> hg diff  opal/mca/paffinity/paffinity.h
>>> diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
>>> --- a/opal/mca/paffinity/paffinity.h    Thu Apr 21 17:38:00 2011 +0200
>>> +++ b/opal/mca/paffinity/paffinity.h    Tue Jul 12 15:44:59 2011 +0200
>>> @@ -218,7 +218,8 @@
>>>                    num_bound++;                                    \
>>>                }                                                   \
>>>            }                                                       \
>>> -            if (0 < num_bound && num_bound < num_processors) {      \
>>> +            if (0 < num_bound && ((num_processors == 1) ||          \
>>> +                                  (num_bound < num_processors))) {  \
>>>                *(bound) = true;                                    \
>>>            }                                                       \
>>>        }                                                           \
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bindings not detected with slurm (srun)

Reply via email to