Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster 
nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
processes running on a given cluster node. Compute nodes are 2-socket, 
8-core E5-2670 systems with HyperThreading on


node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node   0   1
  0:  10  21
  1:  21  10

I run MPI program with command

mpirun  --report-bindings --bind-to core -np 64 ./affinity

The program simply runs sched_getaffinity for each process and prints 
out the result.


---
TEST RUN 1
---
For this particular job the problem is more severe: openmpi fails to run 
at all with error


--
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:c6-6
  Application name:  ./affinity
  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:  odls_default_module.c:551
--

This is SLURM environment variables:

SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'

There is also a lot of warnings like

[compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
available processors)



---
TEST RUN 2
---

In another allocation I got a different error

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c6-19
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

and the allocation was the following

SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'


If in this case I run on only 32 cores

mpirun  --report-bindings --bind-to core -np 32 ./affinity

the process starts, but I get the original binding problem:

[compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
available processors)


Running with --hetero-nodes yields exactly the same results





Hope the above is useful. The problem with binding under SLURM with CPU 
cores spread over nodes seems to be very reproducible. It is actually 
very often that OpenMPI dies with some error like above. These tests 
were run with openmpi-1.8.8 and 1.10.0, both giving same results.


One more suggestion. The warning message (MCW rank 8 is not bound...) is 
ONLY displayed when I use --report-bindings. It is never shown if I 
leave out this option, and although the binding is wrong the user is not 
notified. I think it would be better to show this warning in all cases 
binding fails.


Let me know if you need more information. I can help to debug this - it 
is a rather crucial issue.


Thanks!

Marcin






On 10/02/2015 11:49 PM, Ralph Castain wrote:

Can you please send me the allocation request you made (so I can see what you 
specified on the cmd line), and the mpirun cmd line?

Thanks
Ralph


On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski  
wrote:

Hi,

I fail to make OpenMPI bind to cores correctly when running from within 
SLURM-allocated CPU resources spread over a range of compute nodes in an 
otherwise homogeneous cluster. I have found this thread

http://www.open-mpi.org/community/lists/users/2014/06/24682.php

and did try to use what Ralph suggested there (--hetero-nodes), but it does not 
work (v. 1.10.0). When running with --report-bindings I get messages like

[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available 
processors)

for all ranks outside 

[OMPI users] Open MPI v1.10.1rc1 release

2015-10-03 Thread Jeff Squyres (jsquyres)
Open MPI users --

We have just posted first release candidate for the upcoming v1.10.1 bug fix 
release.  We'd appreciate any testing and/or feedback that you may on this 
release candidate:

http://www.open-mpi.org/software/ompi/v1.10/

Thank you!

Changes since v1.10.0:

- Fix segv when invoking non-blocking reductions with a user-defined
  operation.  Thanks to Rupert Nash and Georg Geiser for identifying
  the issue.
- No longer probe for PCI topology on Solaris (unless running as root).
- Fix for Intel Parallel Studio 2016 ifort partial support of the
  !GCC$ pragma.  Thanks to Fabrice Roy for reporting the problem.
- Bunches of Coverity / static analysis fixes.
- Fixed ROMIO to look for lstat in .  Thanks to William
  Throwe for submitting the patch both upstream and to Open MPI.
- Fixed minor memory leak when attempting to open plugins.
- Fixed type in MPI_IBARRIER C prototype.  Thanks to Harald Servat for
  reporting the issue.
- Add missing man pages for MPI_WIN_CREATE_DYNAMIC, MPI_WIN_ATTACH,
  MPI_WIN_DETACH, MPI_WIN_ALLOCATE, MPI_WIN_ALLOCATE_SHARED.
- When mpirun-launching new applications, only close file descriptors
  that are actually open (resulting in a faster launch in some
  environments).
- Fix "test ==" issues in Open MPI's configure script.  Thank to Kevin
  Buckley for pointing out the issue.
- Fix performance issue in usnic BTL: ensure progress thread is
  throttled back to not aggressively steal CPU cycles.
- Fix cache line size detection on POWER architectures.
- Add missing #include in a few places.  Thanks to Orion Poplawski for
  supplying the patch.
- When OpenSHMEM building is disabled, no longer install its header
  files, help files, or man pages.
- Fix mpi_f08 implementations of MPI_COMM_SET_INFO, and profiling
  versions of MPI_BUFFER_DETACH, MPI_WIN_ALLOCATE,
  MPI_WIN_ALLOCATE_SHARED, MPI_WTICK, and MPI_WTIME.
- Add orte_rmaps_dist_device MCA param, allowing users to map near a
  specific device.
- Various updates/fixes to the openib BTL.
- Add missing defaults for the Mellanox ConnectX 3 card to the openib BTL.
- Minor bug fixes in the OFI MTL.
- Various updates to Mellanox's hcoll and FCA components.
- Add OpenSHMEM man pages.  Thanks to Tony Curtis for sharing the man
  pages files from openshmem.org.
- Add missing "const" attributes to MPI_COMPARE_AND_SWAP,
  MPI_FETCH_AND_OP, MPI_RACCUMULATE, and MPI_WIN_DETACH prototypes.
  Thanks to Michael Knobloch and Takahiro Kawashima for bringing this
  to our attention.
- Fix linking issues on some platforms (e.g., SLES 12).
- Fix hang on some corner cases when MPI applications abort.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?

I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.


> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski  
> wrote:
> 
> Hi, Ralph,
> 
> I submit my slurm job as follows
> 
> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
> 
> Effectively, the allocated CPU cores are spread amount many cluster nodes. 
> SLURM uses cgroups to limit the CPU cores available for mpi processes running 
> on a given cluster node. Compute nodes are 2-socket, 8-core E5-2670 systems 
> with HyperThreading on
> 
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node distances:
> node   0   1
>  0:  10  21
>  1:  21  10
> 
> I run MPI program with command
> 
> mpirun  --report-bindings --bind-to core -np 64 ./affinity
> 
> The program simply runs sched_getaffinity for each process and prints out the 
> result.
> 
> ---
> TEST RUN 1
> ---
> For this particular job the problem is more severe: openmpi fails to run at 
> all with error
> 
> --
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
> 
>  Local host:c6-6
>  Application name:  ./affinity
>  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
>  Location:  odls_default_module.c:551
> --
> 
> This is SLURM environment variables:
> 
> SLURM_JOBID=12712225
> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> SLURM_JOB_ID=12712225
> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> SLURM_JOB_NUM_NODES=24
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=24
> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=64
> SLURM_NTASKS=64
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-2.local
> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> 
> There is also a lot of warnings like
> 
> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all available 
> processors)
> 
> 
> ---
> TEST RUN 2
> ---
> 
> In another allocation I got a different error
> 
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:c6-19
>   #processes:  2
>   #cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> and the allocation was the following
> 
> SLURM_JOBID=12712250
> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> SLURM_JOB_ID=12712250
> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> SLURM_JOB_NUM_NODES=15
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=15
> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=64
> SLURM_NTASKS=64
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-2.local
> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> 
> 
> If in this case I run on only 32 cores
> 
> mpirun  --report-bindings --bind-to core -np 32 ./affinity
> 
> the process starts, but I get the original binding problem:
> 
> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all available 
> processors)
> 
> Running with --hetero-nodes yields exactly the same results
> 
> 
> 
> 
> 
> Hope the above is useful. The problem with binding under SLURM with CPU cores 
> spread over nodes seems to be very reproducible. It is actually very often 
> that OpenMPI dies with some error like above. These tests were run with 
> openmpi-1.8.8 and 1.10.0, both giving same results.
> 
> One more suggestion. The warning message (MCW rank 8 is not bound...) is ONLY 
> displayed when I use --report-bindings. It is never shown if I leave out this 
> option, and although the binding is wrong the user is not notified. I think 
> it would be better to show this warning in all cases binding fails.
> 
> Let me know if you need more information. I can help to debug this - it is a 
> rather crucial issue.
> 
> Thanks!
> 
> Marcin
> 
> 
> 
> 
> 
> 
> On 10/02/2015 11:49 PM, Ralph Ca

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Gilles Gouaillardet
Marcin,

could you give a try at v1.10.1rc1 that was released today ?
it fixes a bug when hwloc was trying to bind outside the cpuset.

Ralph and all,

imho, there are several issues here
- if slurm allocates threads instead of core, then the --oversubscribe
mpirun option could be mandatory
- with --oversubscribe --hetero-nodes, mpirun should not fail, and if it
still fails with v1.10.1rc1, I will ask some more details in order to fix
ompi

Cheers,

Gilles

On Saturday, October 3, 2015, Ralph Castain  wrote:

> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating
> HTs as “cores” - i.e., as independent cpus. Any chance that is true?
>
> I’m wondering because bind-to core will attempt to bind your proc to both
> HTs on the core. For some reason, we thought that 8.24 were HTs on the same
> core, which is why we tried to bind to that pair of HTs. We got an error
> because HT #24 was not allocated to us on node c6, but HT #8 was.
>
>
> > On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com > wrote:
> >
> > Hi, Ralph,
> >
> > I submit my slurm job as follows
> >
> > salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
> >
> > Effectively, the allocated CPU cores are spread amount many cluster
> nodes. SLURM uses cgroups to limit the CPU cores available for mpi
> processes running on a given cluster node. Compute nodes are 2-socket,
> 8-core E5-2670 systems with HyperThreading on
> >
> > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> > node distances:
> > node   0   1
> >  0:  10  21
> >  1:  21  10
> >
> > I run MPI program with command
> >
> > mpirun  --report-bindings --bind-to core -np 64 ./affinity
> >
> > The program simply runs sched_getaffinity for each process and prints
> out the result.
> >
> > ---
> > TEST RUN 1
> > ---
> > For this particular job the problem is more severe: openmpi fails to run
> at all with error
> >
> >
> --
> > Open MPI tried to bind a new process, but something went wrong.  The
> > process was killed without launching the target application.  Your job
> > will now abort.
> >
> >  Local host:c6-6
> >  Application name:  ./affinity
> >  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
> >  Location:  odls_default_module.c:551
> >
> --
> >
> > This is SLURM environment variables:
> >
> > SLURM_JOBID=12712225
> >
> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> > SLURM_JOB_ID=12712225
> >
> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> > SLURM_JOB_NUM_NODES=24
> > SLURM_JOB_PARTITION=normal
> > SLURM_MEM_PER_CPU=2048
> > SLURM_NNODES=24
> >
> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> > SLURM_NODE_ALIASES='(null)'
> > SLURM_NPROCS=64
> > SLURM_NTASKS=64
> > SLURM_SUBMIT_DIR=/cluster/home/marcink
> > SLURM_SUBMIT_HOST=login-0-2.local
> >
> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> >
> > There is also a lot of warnings like
> >
> > [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all
> available processors)
> >
> >
> > ---
> > TEST RUN 2
> > ---
> >
> > In another allocation I got a different error
> >
> >
> --
> > A request was made to bind to that would result in binding more
> > processes than cpus on a resource:
> >
> >   Bind to: CORE
> >   Node:c6-19
> >   #processes:  2
> >   #cpus:   1
> >
> > You can override this protection by adding the "overload-allowed"
> > option to your binding directive.
> >
> --
> >
> > and the allocation was the following
> >
> > SLURM_JOBID=12712250
> > SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> > SLURM_JOB_ID=12712250
> > SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> > SLURM_JOB_NUM_NODES=15
> > SLURM_JOB_PARTITION=normal
> > SLURM_MEM_PER_CPU=2048
> > SLURM_NNODES=15
> > SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> > SLURM_NODE_ALIASES='(null)'
> > SLURM_NPROCS=64
> > SLURM_NTASKS=64
> > SLURM_SUBMIT_DIR=/cluster/home/marcink
> > SLURM_SUBMIT_HOST=login-0-2.local
> > SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> >
> >
> > If in this case I run on only 32 cores
> >
> > mpirun  --report-bindings --bind-to core -np 32 ./affinity
> >
> > the process starts, but I get the original binding problem:
> >
> > [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all
> available processors)
> >
> > Running with --hetero-nodes yields exactly the same results
> >
> >
> >
> >
> >
> > Hope the above is usef

Re: [OMPI users] [Open MPI Announce] Open MPI v1.10.1rc1 release

2015-10-03 Thread Dimitar Pashov
Hi, I have a pet bug causing silent data corruption here:
https://github.com/open-mpi/ompi/issues/965 
which seems to have a fix committed some time later. I've tested v1.10.1rc1 
now and it still has the issue. I hope the fix makes it in the release.

Cheers!

On Saturday 03 Oct 2015 10:18:47 Jeff Squyres wrote:
> Open MPI users --
> 
> We have just posted first release candidate for the upcoming v1.10.1 bug fix
> release.  We'd appreciate any testing and/or feedback that you may on this
> release candidate:
> 
> http://www.open-mpi.org/software/ompi/v1.10/
> 
> Thank you!
> 
> Changes since v1.10.0:
> 
> - Fix segv when invoking non-blocking reductions with a user-defined
>   operation.  Thanks to Rupert Nash and Georg Geiser for identifying
>   the issue.
> - No longer probe for PCI topology on Solaris (unless running as root).
> - Fix for Intel Parallel Studio 2016 ifort partial support of the
>   !GCC$ pragma.  Thanks to Fabrice Roy for reporting the problem.
> - Bunches of Coverity / static analysis fixes.
> - Fixed ROMIO to look for lstat in .  Thanks to William
>   Throwe for submitting the patch both upstream and to Open MPI.
> - Fixed minor memory leak when attempting to open plugins.
> - Fixed type in MPI_IBARRIER C prototype.  Thanks to Harald Servat for
>   reporting the issue.
> - Add missing man pages for MPI_WIN_CREATE_DYNAMIC, MPI_WIN_ATTACH,
>   MPI_WIN_DETACH, MPI_WIN_ALLOCATE, MPI_WIN_ALLOCATE_SHARED.
> - When mpirun-launching new applications, only close file descriptors
>   that are actually open (resulting in a faster launch in some
>   environments).
> - Fix "test ==" issues in Open MPI's configure script.  Thank to Kevin
>   Buckley for pointing out the issue.
> - Fix performance issue in usnic BTL: ensure progress thread is
>   throttled back to not aggressively steal CPU cycles.
> - Fix cache line size detection on POWER architectures.
> - Add missing #include in a few places.  Thanks to Orion Poplawski for
>   supplying the patch.
> - When OpenSHMEM building is disabled, no longer install its header
>   files, help files, or man pages.
> - Fix mpi_f08 implementations of MPI_COMM_SET_INFO, and profiling
>   versions of MPI_BUFFER_DETACH, MPI_WIN_ALLOCATE,
>   MPI_WIN_ALLOCATE_SHARED, MPI_WTICK, and MPI_WTIME.
> - Add orte_rmaps_dist_device MCA param, allowing users to map near a
>   specific device.
> - Various updates/fixes to the openib BTL.
> - Add missing defaults for the Mellanox ConnectX 3 card to the openib BTL.
> - Minor bug fixes in the OFI MTL.
> - Various updates to Mellanox's hcoll and FCA components.
> - Add OpenSHMEM man pages.  Thanks to Tony Curtis for sharing the man
>   pages files from openshmem.org.
> - Add missing "const" attributes to MPI_COMPARE_AND_SWAP,
>   MPI_FETCH_AND_OP, MPI_RACCUMULATE, and MPI_WIN_DETACH prototypes.
>   Thanks to Michael Knobloch and Takahiro Kawashima for bringing this
>   to our attention.
> - Fix linking issues on some platforms (e.g., SLES 12).
> - Fix hang on some corner cases when MPI applications abort.

-- 
Dr Dimitar Pashov
Department of Physics, S4.02
King's College London
Strand, London, WC2R 2LS, UK


Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


On 10/03/2015 01:06 PM, Ralph Castain wrote:

Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?
Not to the best of my knowledge, and at least not intentionally. SLURM 
starts as many processes as there are physical cores, not threads. To 
verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go onto 
c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, sched_getaffinity 
in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no binding):

rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,

rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,



Still looks ok to me. If I now turn the binding on, openmpi fails:


--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c1-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The above tests were done with 1.10.1rc1, so it does not fix the problem.

Marcin



I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.



On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski  
wrote:

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster nodes. 
SLURM uses cgroups to limit the CPU cores available for mpi proces

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
If mpirun isn’t trying to do any binding, then you will of course get the right 
mapping as we’ll just inherit whatever we received. Looking at your output, 
it’s pretty clear that you are getting independent HTs assigned and not full 
cores. My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as completely 
independent cpus.

Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” and see 
if that works


> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski  
> wrote:
> 
> 
> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs 
>> as “cores” - i.e., as independent cpus. Any chance that is true?
> Not to the best of my knowledge, and at least not intentionally. SLURM starts 
> as many processes as there are physical cores, not threads. To verify this, 
> consider this test case:
> 
> SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
> SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'
> 
> If I now execute only one mpi process WITH NO BINDING, it will go onto c1-30 
> and should have a map with 6 CPUs (12 hw threads). I run
> 
> mpirun --bind-to none -np 1 ./affinity
> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> 
> I have attached the affinity.c program FYI. Clearly, sched_getaffinity in my 
> test code returns the correct map.
> 
> Now if I try to start all 32 processes in this example (still no binding):
> 
> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
> 28, 29, 30, 31,
> rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
> 28, 29, 30, 31,
> rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
> 28, 29, 30, 31,
> rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
> 28, 29, 30, 31,
> rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 28, 
> 29, 30, 31,
> rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 28, 
> 29, 30, 31,
> rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 28, 
> 29, 30, 31,
> rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 28, 
> 29, 30, 31,
> rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
> rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 26, 
> 27, 28, 29, 30,
> rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 19, 
> 20, 21, 22, 23, 30, 31,
> 
> 
> Still looks ok to me. If I now turn the binding on, openmpi fails:
> 
> 
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:c1-31
>   #processes:  2
>   #cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> ---

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course get 
the right mapping as we’ll just inherit whatever we received. 
Yes. I meant that whatever you received (what SLURM gives) is a correct 
cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In 
the case mentioned earlier openmpi should start 6 tasks on c1-30. If HT 
would be treated as separate and independent cores, sched_getaffinity of 
an MPI process started on c1-30 would return a map with 6 entries only. 
In my case it returns a map with 12 entries - 2 for each core. So one  
process is in fact allocated both HTs, not only one. Is what I'm saying 
correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores. 
How do you mean? Is the above understanding wrong? I would expect that 
on c1-30 with --bind-to core openmpi should bind to logical cores 0 and 
16 (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
available in sched_getaffinity map, and there is twice as many logical 
cores as there are MPI processes started on the node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
things. It still might have improved something, but not everything. 
Consider this job:


SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

If I run 32 tasks as follows (with 1.10.1rc1)

mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity

I get the following error:

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c9-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--


If I now use --bind-to core:overload-allowed, then openmpi starts and 
_most_ of the threads are bound correctly (i.e., map contains two 
logical cores in ALL cases), except this case that required the overload 
flag:


rank 15 @ compute-9-31.local   1, 17,
rank 16 @ compute-9-31.local  11, 27,
rank 17 @ compute-9-31.local   2, 18,
rank 18 @ compute-9-31.local  12, 28,
rank 19 @ compute-9-31.local   1, 17,

Note pair 1,17 is used twice. The original SLURM delivered map (no 
binding) on this node is


rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,

Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
Clearly, the original SLURM-delivered map has 5 CPUs included, enough 
for 5 MPI processes.


Cheers,

Marcin




On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that Slurm may be 
treating HTs as “cores” - i.e., as independent cpus. Any chance that 
is true?
Not to the best of my knowledge, and at least not intentionally. 
SLURM starts as many processes as there are physical cores, not 
threads. To verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go 
onto c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, 
sched_getaffinity in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no 
binding):


rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 
23, 27

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new one 
to me, but they tend to change things around. Could you run lstopo on one of 
those compute nodes and send the output?

I’m just suspicious because I’m not seeing a clear pairing of HT numbers in 
your output, but HT numbering is BIOS-specific and I may just not be 
understanding your particular pattern. Our error message is clearly indicating 
that we are seeing individual HTs (and not complete cores) assigned, and I 
don’t know the source of that confusion.


> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski  
> wrote:
> 
> 
> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>> If mpirun isn’t trying to do any binding, then you will of course get the 
>> right mapping as we’ll just inherit whatever we received.
> Yes. I meant that whatever you received (what SLURM gives) is a correct cpu 
> map and assigns _whole_ CPUs, not a single HT to MPI processes. In the case 
> mentioned earlier openmpi should start 6 tasks on c1-30. If HT would be 
> treated as separate and independent cores, sched_getaffinity of an MPI 
> process started on c1-30 would return a map with 6 entries only. In my case 
> it returns a map with 12 entries - 2 for each core. So one  process is in 
> fact allocated both HTs, not only one. Is what I'm saying correct?
> 
>> Looking at your output, it’s pretty clear that you are getting independent 
>> HTs assigned and not full cores. 
> How do you mean? Is the above understanding wrong? I would expect that on 
> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 (rank 
> 0), 1 and 17 (rank 2) and so on. All those logical cores are available in 
> sched_getaffinity map, and there is twice as many logical cores as there are 
> MPI processes started on the node.
> 
>> My guess is that something in slurm has changed such that it detects that HT 
>> has been enabled, and then begins treating the HTs as completely independent 
>> cpus.
>> 
>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” and 
>> see if that works
>> 
> I have and the binding is wrong. For example, I got this output
> 
> rank 0 @ compute-1-30.local  0,
> rank 1 @ compute-1-30.local  16,
> 
> Which means that two ranks have been bound to the same physical core (logical 
> cores 0 and 16 are two HTs of the same core). If I use --bind-to core, I get 
> the following correct binding
> 
> rank 0 @ compute-1-30.local  0, 16,
> 
> The problem is many other ranks get bad binding with 'rank XXX is not bound 
> (or bound to all available processors)' warning.
> 
> But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
> things. It still might have improved something, but not everything. Consider 
> this job:
> 
> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
> 
> If I run 32 tasks as follows (with 1.10.1rc1)
> 
> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
> 
> I get the following error:
> 
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>Bind to: CORE
>Node:c9-31
>#processes:  2
>#cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> 
> If I now use --bind-to core:overload-allowed, then openmpi starts and _most_ 
> of the threads are bound correctly (i.e., map contains two logical cores in 
> ALL cases), except this case that required the overload flag:
> 
> rank 15 @ compute-9-31.local   1, 17,
> rank 16 @ compute-9-31.local  11, 27,
> rank 17 @ compute-9-31.local   2, 18, 
> rank 18 @ compute-9-31.local  12, 28,
> rank 19 @ compute-9-31.local   1, 17,
> 
> Note pair 1,17 is used twice. The original SLURM delivered map (no binding) 
> on this node is
> 
> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29, 
> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> 
> Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
> Clearly, the original SLURM-delivered map has 5 CPUs included, enough for 5 
> MPI processes. 
> 
> Cheers,
> 
> Marcin
> 
> 
>> 
>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
>>> mailto:marcin.krotkiew...@gmail.com>> wrote:
>>> 
>>> 
>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
 Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating 
 HTs as “cores” - i.e., as independent cpus. Any chance that is true?
>>> Not to the best of my knowledge, and at least not intentionally. SLURM 
>>> starts as m

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 
1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax is 
a new one to me, but they tend to change things around. Could you run 
lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error message 
is clearly indicating that we are seeing individual HTs (and not 
complete cores) assigned, and I don’t know the source of that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course 
get the right mapping as we’ll just inherit whatever we received.
Yes. I meant that whatever you received (what SLURM gives) is a 
correct cpu map and assigns _whole_ CPUs, not a single HT to MPI 
processes. In the case mentioned earlier openmpi should start 6 tasks 
on c1-30. If HT would be treated as separate and independent cores, 
sched_getaffinity of an MPI process started on c1-30 would return a 
map with 6 entries only. In my case it returns a map with 12 entries 
- 2 for each core. So one  process is in fact allocated both HTs, not 
only one. Is what I'm saying correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores.
How do you mean? Is the above understanding wrong? I would expect 
that on c1-30 with --bind-to core openmpi should bind to logical 
cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All those 
logical cores are available in sched_getaffinity map, and there is 
twice as many logical cores as there are MPI processes started on the 
node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1rc1 did not 
fix things. It still might have improved something, but not 
everythi

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
What version of slurm is this? I might try to debug it here. I’m not sure where 
the problem lies just yet.


> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski  
> wrote:
> 
> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
> etc.
> 
> Machine (64GB)
>   NUMANode L#0 (P#0 32GB)
> Socket L#0 + L3 L#0 (20MB)
>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#16)
>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> PU L#2 (P#1)
> PU L#3 (P#17)
>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> PU L#4 (P#2)
> PU L#5 (P#18)
>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> PU L#6 (P#3)
> PU L#7 (P#19)
>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
> PU L#8 (P#4)
> PU L#9 (P#20)
>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
> PU L#10 (P#5)
> PU L#11 (P#21)
>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
> PU L#12 (P#6)
> PU L#13 (P#22)
>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
> PU L#14 (P#7)
> PU L#15 (P#23)
> HostBridge L#0
>   PCIBridge
> PCI 8086:1521
>   Net L#0 "eth0"
> PCI 8086:1521
>   Net L#1 "eth1"
>   PCIBridge
> PCI 15b3:1003
>   Net L#2 "ib0"
>   OpenFabrics L#3 "mlx4_0"
>   PCIBridge
> PCI 102b:0532
>   PCI 8086:1d02
> Block L#4 "sda"
>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>   PU L#16 (P#8)
>   PU L#17 (P#24)
> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>   PU L#18 (P#9)
>   PU L#19 (P#25)
> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>   PU L#20 (P#10)
>   PU L#21 (P#26)
> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>   PU L#22 (P#11)
>   PU L#23 (P#27)
> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>   PU L#24 (P#12)
>   PU L#25 (P#28)
> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>   PU L#26 (P#13)
>   PU L#27 (P#29)
> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>   PU L#28 (P#14)
>   PU L#29 (P#30)
> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>   PU L#30 (P#15)
>   PU L#31 (P#31)
> 
> 
> 
> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new 
>> one to me, but they tend to change things around. Could you run lstopo on 
>> one of those compute nodes and send the output?
>> 
>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers in 
>> your output, but HT numbering is BIOS-specific and I may just not be 
>> understanding your particular pattern. Our error message is clearly 
>> indicating that we are seeing individual HTs (and not complete cores) 
>> assigned, and I don’t know the source of that confusion.
>> 
>> 
>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
>>> mailto:marcin.krotkiew...@gmail.com>> wrote:
>>> 
>>> 
>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
 If mpirun isn’t trying to do any binding, then you will of course get the 
 right mapping as we’ll just inherit whatever we received.
>>> Yes. I meant that whatever you received (what SLURM gives) is a correct cpu 
>>> map and assigns _whole_ CPUs, not a single HT to MPI processes. In the case 
>>> mentioned earlier openmpi should start 6 tasks on c1-30. If HT would be 
>>> treated as separate and independent cores, sched_getaffinity of an MPI 
>>> process started on c1-30 would return a map with 6 entries only. In my case 
>>> it returns a map with 12 entries - 2 for each core. So one  process is in 
>>> fact allocated both HTs, not only one. Is what I'm saying correct?
>>> 
 Looking at your output, it’s pretty clear that you are getting independent 
 HTs assigned and not full cores. 
>>> How do you mean? Is the above understanding wrong? I would expect that on 
>>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 
>>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
>>> available in sched_getaffinity map, and there is twice as many logical 
>>> cores as there are MPI processes started on the node.
>>> 
 My guess is that something in slurm has changed such that it detects that 
 HT has been enabled, and then begins treating the HTs as completely 
 independent cpus.
 
 Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” 
 and see if that works
 
>>> I have and the binding is wrong. For example, I got this output
>>> 
>>> rank 0 @ compute-1-30.local  0,
>>> rank 1 @ compute-1-30.local  16,
>>> 
>

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Rats - just realized I have no way to test this as none of the machines I can 
access are setup for cgroup-based multi-tenant. Is this a debug version of 
OMPI? If not, can you rebuild OMPI with —enable-debug?

Then please run it with —mca rmaps_base_verbose 10 and pass along the output.

Thanks
Ralph


> On Oct 3, 2015, at 10:09 AM, Ralph Castain  wrote:
> 
> What version of slurm is this? I might try to debug it here. I’m not sure 
> where the problem lies just yet.
> 
> 
>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski > > wrote:
>> 
>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
>> etc.
>> 
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>> PU L#0 (P#0)
>> PU L#1 (P#16)
>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>> PU L#2 (P#1)
>> PU L#3 (P#17)
>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>> PU L#4 (P#2)
>> PU L#5 (P#18)
>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>> PU L#6 (P#3)
>> PU L#7 (P#19)
>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>> PU L#8 (P#4)
>> PU L#9 (P#20)
>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>> PU L#10 (P#5)
>> PU L#11 (P#21)
>>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>> PU L#12 (P#6)
>> PU L#13 (P#22)
>>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>> PU L#14 (P#7)
>> PU L#15 (P#23)
>> HostBridge L#0
>>   PCIBridge
>> PCI 8086:1521
>>   Net L#0 "eth0"
>> PCI 8086:1521
>>   Net L#1 "eth1"
>>   PCIBridge
>> PCI 15b3:1003
>>   Net L#2 "ib0"
>>   OpenFabrics L#3 "mlx4_0"
>>   PCIBridge
>> PCI 102b:0532
>>   PCI 8086:1d02
>> Block L#4 "sda"
>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>   PU L#16 (P#8)
>>   PU L#17 (P#24)
>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>   PU L#18 (P#9)
>>   PU L#19 (P#25)
>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>   PU L#20 (P#10)
>>   PU L#21 (P#26)
>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>   PU L#22 (P#11)
>>   PU L#23 (P#27)
>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>   PU L#24 (P#12)
>>   PU L#25 (P#28)
>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>   PU L#26 (P#13)
>>   PU L#27 (P#29)
>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>   PU L#28 (P#14)
>>   PU L#29 (P#30)
>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>   PU L#30 (P#15)
>>   PU L#31 (P#31)
>> 
>> 
>> 
>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new 
>>> one to me, but they tend to change things around. Could you run lstopo on 
>>> one of those compute nodes and send the output?
>>> 
>>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers in 
>>> your output, but HT numbering is BIOS-specific and I may just not be 
>>> understanding your particular pattern. Our error message is clearly 
>>> indicating that we are seeing individual HTs (and not complete cores) 
>>> assigned, and I don’t know the source of that confusion.
>>> 
>>> 
 On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
 mailto:marcin.krotkiew...@gmail.com>> wrote:
 
 
 On 10/03/2015 04:38 PM, Ralph Castain wrote:
> If mpirun isn’t trying to do any binding, then you will of course get the 
> right mapping as we’ll just inherit whatever we received.
 Yes. I meant that whatever you received (what SLURM gives) is a correct 
 cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the 
 case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would 
 be treated as separate and independent cores, sched_getaffinity of an MPI 
 process started on c1-30 would return a map with 6 entries only. In my 
 case it returns a map with 12 entries - 2 for each core. So one  process 
 is in fact allocated both HTs, not only one. Is what I'm saying correct?
 
> Looking at your output, it’s pretty clear that you are getting 
> independent HTs assigned and not full cores. 
 How do you mean? Is the above understanding wrong? I would expect that on 
 c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 
 (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
 available in sched_getaffinity map, and there is twice as many logic

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings 
--bind-to core -np 32 ./affinity


In case of 1.10.rc1 I have also added :overload-allowed - output in a 
separate file. This option did not make much difference for 1.10.0, so I 
did not attach it here.


First thing I noted for 1.10.0 are lines like

[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS 
NOT BOUND


with an empty BITMAP.

The SLURM environment is

set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

I have submitted an interactive job on screen for 120 hours now to work 
with one example, and not change it for every post :)


If you need anything else, let me know. I could introduce some 
patch/printfs and recompile, if you need it.


Marcin



On 10/03/2015 07:17 PM, Ralph Castain wrote:
Rats - just realized I have no way to test this as none of the 
machines I can access are setup for cgroup-based multi-tenant. Is this 
a debug version of OMPI? If not, can you rebuild OMPI with —enable-debug?


Then please run it with —mca rmaps_base_verbose 10 and pass along the 
output.


Thanks
Ralph


On Oct 3, 2015, at 10:09 AM, Ralph Castain > wrote:


What version of slurm is this? I might try to debug it here. I’m not 
sure where the problem lies just yet.



On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - 
core 1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax 
is a new one to me, but they tend to change things around. Could 
you run lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error 
message is clearly indicating that we are seeing individual HTs 
(and not complete cores) assigned, and I don’t know the source of 
that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
> wrote:



On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun i

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Thanks - please go ahead and release that allocation as I’m not going to get to 
this immediately. I’ve got several hot irons in the fire right now, and I’m not 
sure when I’ll get a chance to track this down.

Gilles or anyone else who might have time - feel free to take a gander and see 
if something pops out at you.

Ralph


> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski 
>  wrote:
> 
> 
> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed
> 
> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings --bind-to 
> core -np 32 ./affinity
> 
> In case of 1.10.rc1 I have also added :overload-allowed - output in a 
> separate file. This option did not make much difference for 1.10.0, so I did 
> not attach it here.
> 
> First thing I noted for 1.10.0 are lines like
> 
> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT 
> BOUND
> 
> with an empty BITMAP.
> 
> The SLURM environment is
> 
> set | grep SLURM
> SLURM_JOBID=12714491
> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> SLURM_JOB_ID=12714491
> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_JOB_NUM_NODES=7
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=7
> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=32
> SLURM_NTASKS=32
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-1.local
> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
> 
> I have submitted an interactive job on screen for 120 hours now to work with 
> one example, and not change it for every post :)
> 
> If you need anything else, let me know. I could introduce some patch/printfs 
> and recompile, if you need it.
> 
> Marcin
> 
> 
> 
> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>> Rats - just realized I have no way to test this as none of the machines I 
>> can access are setup for cgroup-based multi-tenant. Is this a debug version 
>> of OMPI? If not, can you rebuild OMPI with —enable-debug?
>> 
>> Then please run it with —mca rmaps_base_verbose 10 and pass along the output.
>> 
>> Thanks
>> Ralph
>> 
>> 
>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain >> > wrote:
>>> 
>>> What version of slurm is this? I might try to debug it here. I’m not sure 
>>> where the problem lies just yet.
>>> 
>>> 
 On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
 mailto:marcin.krotkiew...@gmail.com>> wrote:
 
 Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
 etc.
 
 Machine (64GB)
   NUMANode L#0 (P#0 32GB)
 Socket L#0 + L3 L#0 (20MB)
   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
 PU L#0 (P#0)
 PU L#1 (P#16)
   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
 PU L#2 (P#1)
 PU L#3 (P#17)
   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
 PU L#4 (P#2)
 PU L#5 (P#18)
   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
 PU L#6 (P#3)
 PU L#7 (P#19)
   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
 PU L#8 (P#4)
 PU L#9 (P#20)
   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
 PU L#10 (P#5)
 PU L#11 (P#21)
   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
 PU L#12 (P#6)
 PU L#13 (P#22)
   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
 PU L#14 (P#7)
 PU L#15 (P#23)
 HostBridge L#0
   PCIBridge
 PCI 8086:1521
   Net L#0 "eth0"
 PCI 8086:1521
   Net L#1 "eth1"
   PCIBridge
 PCI 15b3:1003
   Net L#2 "ib0"
   OpenFabrics L#3 "mlx4_0"
   PCIBridge
 PCI 102b:0532
   PCI 8086:1d02
 Block L#4 "sda"
   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
 L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
   PU L#16 (P#8)
   PU L#17 (P#24)
 L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
   PU L#18 (P#9)
   PU L#19 (P#25)
 L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
   PU L#20 (P#10)
   PU L#21 (P#26)
 L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
   PU L#22 (P#11)
   PU L#23 (P#27)
 L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
   PU L#24 (P#12)
   PU L#25 (P#28)
 L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
   PU L#26 (P#13)
   PU L#27 (P#29)
 L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
   PU L#28 (P#14)