Okay - thx! I'll install in trunk and schedule for 1.5
On Aug 22, 2011, at 7:20 AM, pascal.dev...@bull.net wrote: > > users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 : > >> De : Ralph Castain <r...@open-mpi.org> >> A : Open MPI Users <us...@open-mpi.org> >> Date : 18/08/2011 14:45 >> Objet : Re: [OMPI users] Bindings not detected with slurm (srun) >> Envoyé par : users-boun...@open-mpi.org >> >> Afraid I am confused. I assume this refers to the trunk, yes? > > I work with V1.5. > >> >> I also assume you are talking about launching an application >> directly from srun as opposed to using mpirun - yes? > > Yes > >> >> In that case, I fail to understand what difference it makes >> regarding this proposed change. The application process is being >> directly bound by slurm, so what paffinity thinks is irrelevant, >> except perhaps for some debugging I suppose. Is that what you are >> concerned about? > > I have a framework that has to check if the processes are bound. This > framework > uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all > processes are bound. > > That runs well except when I use srun with slurm configured to bind > each single rank with a singleton. > > For exemple, I use nodes with 8 sockets of 4 cores. The command srun > generates 32 cpusets (one for each core) and binds the 32 processes, one > on each cpuset. > Then the macro returns *bound=false, and my framework considers that the > processes are not bound and doesn't do the job correctly. > > The patch modifies the macro to return *bound=true when a single > process is bound to a cpuset of one core. > >> >> I'd just like to know what problem is actually being solved here. I >> agree that, if there is only one processor in a system, you are >> effectively "bound". >> >> >> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote: >> >>> Hi all, >>> >>> When slurm is configured with the following parameters >>> TaskPlugin=task/affinity >>> TaskPluginParam=Cpusets >>> srun binds the processes by placing them into different >>> cpusets, each containing a single core. >>> >>> e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two > allocated >>> nodes and place the four ranks there, each single rank with a singleton > as >>> a cpu constraint. >>> >>> The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND > (in >>> opal/mca/paffinity/paffinity.h): >>> . opal_paffinity_base_get_processor_info() fills in num_processors > with 1 >>> (this is the size of each cpu_set) >>> . num_bound is set to 1 too >>> and this implies *bound=false >>> >>> So, the binding is correctly done by slurm and not detected by MPI. >>> >>> To support the cpuset binding done by slurm, I propose the following > patch: >>> >>> hg diff opal/mca/paffinity/paffinity.h >>> diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h >>> --- a/opal/mca/paffinity/paffinity.h Thu Apr 21 17:38:00 2011 +0200 >>> +++ b/opal/mca/paffinity/paffinity.h Tue Jul 12 15:44:59 2011 +0200 >>> @@ -218,7 +218,8 @@ >>> num_bound++; \ >>> } \ >>> } \ >>> - if (0 < num_bound && num_bound < num_processors) { \ >>> + if (0 < num_bound && ((num_processors == 1) || \ >>> + (num_bound < num_processors))) { \ >>> *(bound) = true; \ >>> } \ >>> } \ >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users