Hi, all,
I played a bit more and it seems that the problem results from
trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
called in rmaps_base_binding.c / bind_downwards being wrong. I do not
know the reason, but I think I know when the problem happens (at least
on 1.10.1rc1). It seems that by default openmpi maps by socket. The
error happens when for a given compute node there is a different number
of cores used on each socket. Consider previously studied case (the
debug outputs I sent in last post). c1-8, which was source of error, has
5 mpi processes assigned, and the cpuset is the following:
0, 5, 9, 13, 14, 16, 21, 25, 29, 30
Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding
progresses correctly up to and including core 13 (see end of file
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2
cores on socket 1. Error is thrown when core 14 should be bound - extra
core on socket 1 with no corresponding core on socket 0. At that point
the returned trg_obj points to the first core on the node (os_index 0,
socket 0).
I have submitted a few other jobs and I always had an error in such
situation. Moreover, if I now use --map-by core instead of socket, the
error is gone, and I get my expected binding:
rank 0 @ compute-1-2.local 1, 17,
rank 1 @ compute-1-2.local 2, 18,
rank 2 @ compute-1-2.local 3, 19,
rank 3 @ compute-1-2.local 4, 20,
rank 4 @ compute-1-4.local 1, 17,
rank 5 @ compute-1-4.local 15, 31,
rank 6 @ compute-1-8.local 0, 16,
rank 7 @ compute-1-8.local 5, 21,
rank 8 @ compute-1-8.local 9, 25,
rank 9 @ compute-1-8.local 13, 29,
rank 10 @ compute-1-8.local 14, 30,
rank 11 @ compute-1-13.local 3, 19,
rank 12 @ compute-1-13.local 4, 20,
rank 13 @ compute-1-13.local 5, 21,
rank 14 @ compute-1-13.local 6, 22,
rank 15 @ compute-1-13.local 7, 23,
rank 16 @ compute-1-16.local 12, 28,
rank 17 @ compute-1-16.local 13, 29,
rank 18 @ compute-1-16.local 14, 30,
rank 19 @ compute-1-16.local 15, 31,
rank 20 @ compute-1-23.local 2, 18,
rank 29 @ compute-1-26.local 11, 27,
rank 21 @ compute-1-23.local 3, 19,
rank 30 @ compute-1-26.local 13, 29,
rank 22 @ compute-1-23.local 4, 20,
rank 31 @ compute-1-26.local 15, 31,
rank 23 @ compute-1-23.local 8, 24,
rank 27 @ compute-1-26.local 1, 17,
rank 24 @ compute-1-23.local 13, 29,
rank 28 @ compute-1-26.local 6, 22,
rank 25 @ compute-1-23.local 14, 30,
rank 26 @ compute-1-23.local 15, 31,
Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and
1.10.1rc1. However, there is still a difference in behavior between
1.10.1rc1 and earlier versions. In the SLURM job described in last post,
1.10.1rc1 fails to bind only in 1 case, while the earlier versions fail
in 21 out of 32 cases. You mentioned there was a bug in hwloc. Not sure
if it can explain the difference in behavior.
Hope this helps to nail this down.
Marcin
On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
Ralph,
I suspect ompi tries to bind to threads outside the cpuset.
this could be pretty similar to a previous issue when ompi tried to
bind to cores outside the cpuset.
/* when a core has more than one thread, would ompi assume all the
threads are available if the core is available ? */
I will investigate this from tomorrow
Cheers,
Gilles
On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
Thanks - please go ahead and release that allocation as I’m not
going to get to this immediately. I’ve got several hot irons in
the fire right now, and I’m not sure when I’ll get a chance to
track this down.
Gilles or anyone else who might have time - feel free to take a
gander and see if something pops out at you.
Ralph
On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
wrote:
Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and
executed
mpirun --mca rmaps_base_verbose 10 --hetero-nodes
--report-bindings --bind-to core -np 32 ./affinity
In case of 1.10.rc1 I have also added :overload-allowed - output
in a separate file. This option did not make much difference for
1.10.0, so I did not attach it here.
First thing I noted for 1.10.0 are lines like
[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON
c1-26 IS NOT BOUND
with an empty BITMAP.
The SLURM environment is
set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
I have submitted an interactive job on screen for 120 hours now
to work with one example, and not change it for every post :)
If you need anything else, let me know. I could introduce some
patch/printfs and recompile, if you need it.
Marcin
On 10/03/2015 07:17 PM, Ralph Castain wrote:
Rats - just realized I have no way to test this as none of the
machines I can access are setup for cgroup-based multi-tenant.
Is this a debug version of OMPI? If not, can you rebuild OMPI
with —enable-debug?
Then please run it with —mca rmaps_base_verbose 10 and pass
along the output.
Thanks
Ralph
On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org
<javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
What version of slurm is this? I might try to debug it here.
I’m not sure where the problem lies just yet.
On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
Here is the output of lstopo. In short, (0,16) are core 0,
(1,17) - core 1 etc.
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
PCIBridge
PCI 8086:1521
Net L#0 "eth0"
PCI 8086:1521
Net L#1 "eth1"
PCIBridge
PCI 15b3:1003
Net L#2 "ib0"
OpenFabrics L#3 "mlx4_0"
PCIBridge
PCI 102b:0532
PCI 8086:1d02
Block L#4 "sda"
NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core
L#10
PU L#20 (P#10)
PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core
L#11
PU L#22 (P#11)
PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core
L#12
PU L#24 (P#12)
PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core
L#13
PU L#26 (P#13)
PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core
L#14
PU L#28 (P#14)
PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core
L#15
PU L#30 (P#15)
PU L#31 (P#31)
On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist
syntax is a new one to me, but they tend to change things
around. Could you run lstopo on one of those compute nodes
and send the output?
I’m just suspicious because I’m not seeing a clear pairing of
HT numbers in your output, but HT numbering is BIOS-specific
and I may just not be understanding your particular pattern.
Our error message is clearly indicating that we are seeing
individual HTs (and not complete cores) assigned, and I don’t
know the source of that confusion.
On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
wrote:
On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of
course get the right mapping as we’ll just inherit whatever
we received.
Yes. I meant that whatever you received (what SLURM gives)
is a correct cpu map and assigns _whole_ CPUs, not a single
HT to MPI processes. In the case mentioned earlier openmpi
should start 6 tasks on c1-30. If HT would be treated as
separate and independent cores, sched_getaffinity of an MPI
process started on c1-30 would return a map with 6 entries
only. In my case it returns a map with 12 entries - 2 for
each core. So one process is in fact allocated both HTs, not
only one. Is what I'm saying correct?
Looking at your output, it’s pretty clear that you are
getting independent HTs assigned and not full cores.
How do you mean? Is the above understanding wrong? I would
expect that on c1-30 with --bind-to core openmpi should bind
to logical cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so
on. All those logical cores are available in
sched_getaffinity map, and there is twice as many logical
cores as there are MPI processes started on the node.
My guess is that something in slurm has changed such that
it detects that HT has been enabled, and then begins
treating the HTs as completely independent cpus.
Try changing “-bind-to core” to “-bind-to hwthread
-use-hwthread-cpus” and see if that works
I have and the binding is wrong. For example, I got this output
rank 0 @ compute-1-30.local 0,
rank 1 @ compute-1-30.local 16,
Which means that two ranks have been bound to the same
physical core (logical cores 0 and 16 are two HTs of the
same core). If I use --bind-to core, I get the following
correct binding
rank 0 @ compute-1-30.local 0, 16,
The problem is many other ranks get bad binding with 'rank
XXX is not bound (or bound to all available processors)'
warning.
But I think I was not entirely correct saying that 1.10.1rc1
did not fix things. It still might have improved something,
but not everything. Consider this job:
SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
If I run 32 tasks as follows (with 1.10.1rc1)
mpirun --hetero-nodes --report-bindings --bind-to core -np
32 ./affinity
I get the following error:
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c9-31
#processes: 2
#cpus: 1
You can override this protection by adding the
"overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
If I now use --bind-to core:overload-allowed, then openmpi
starts and _most_ of the threads are bound correctly (i.e.,
map contains two logical cores in ALL cases), except this
case that required the overload flag:
rank 15 @ compute-9-31.local 1, 17,
rank 16 @ compute-9-31.local 11, 27,
rank 17 @ compute-9-31.local 2, 18,
rank 18 @ compute-9-31.local 12, 28,
rank 19 @ compute-9-31.local 1, 17,
Note pair 1,17 is used twice. The original SLURM delivered
map (no binding) on this node is
rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27,
28, 29,
rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27,
28, 29,
rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27,
28, 29,
rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27,
28, 29,
rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27,
28, 29,
Why does openmpi use cores (1,17) twice instead of using
core (13,29)? Clearly, the original SLURM-delivered map has
5 CPUs included, enough for 5 MPI processes.
Cheers,
Marcin
On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
wrote:
On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that Slurm
may be treating HTs as “cores” - i.e., as independent
cpus. Any chance that is true?
Not to the best of my knowledge, and at least not
intentionally. SLURM starts as many processes as there are
physical cores, not threads. To verify this, consider this
test case:
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27790.php