Marcin,
here is a patch for the master, hopefully it fixes all the issues we
discussed
i will make sure it applies fine vs latest 1.10 tarball from tomorrow
Cheers,
Gilles
On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
Gilles,
Yes, it seemed that all was fine with binding in the patched 1.10.1rc1
- thank you. Eagerly waiting for the other patches, let me know and I
will test them later this week.
Marcin
On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
Marcin,
my understanding is that in this case, patched v1.10.1rc1 is working
just fine.
am I right ?
I prepared two patches
one to remove the warning when binding on one core if only one core
is available,
an other one to add a warning if the user asks a binding policy that
makes no sense with the required mapping policy
I will finalize them tomorrow hopefully
Cheers,
Gilles
On Tuesday, October 6, 2015, marcin.krotkiewski
<marcin.krotkiew...@gmail.com> wrote:
Hi, Gilles
you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and
output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
Cpus_allowed_list /proc/self/status
before invoking mpirun
It was an interactive job allocated with
salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
The slurm environment is the following
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
The output of the command you asked for is
0: c1-2.local Cpus_allowed_list: 1-4,17-20
1: c1-4.local Cpus_allowed_list: 1,15,17,31
2: c1-8.local Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30
3: c1-13.local Cpus_allowed_list: 3-7,19-23
4: c1-16.local Cpus_allowed_list: 12-15,28-31
5: c1-23.local Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31
6: c1-26.local Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31
Running with command
mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core
--report-bindings --map-by socket -np 32 ./affinity
I have attached two output files: one for the original 1.10.1rc1,
one for the patched version.
When I said 'failed in one case' I was not precise. I got an
error on node c1-8, which was the first one to have different
number of MPI processes on the two sockets. It would also fail on
some later nodes, just that because of the error we never got there.
Let me know if you need more.
Marcin
Cheers,
Gilles
On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
Hi, all,
I played a bit more and it seems that the problem results from
trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
called in rmaps_base_binding.c / bind_downwards being wrong. I
do not know the reason, but I think I know when the problem
happens (at least on 1.10.1rc1). It seems that by default
openmpi maps by socket. The error happens when for a given
compute node there is a different number of cores used on each
socket. Consider previously studied case (the debug outputs I
sent in last post). c1-8, which was source of error, has 5 mpi
processes assigned, and the cpuset is the following:
0, 5, 9, 13, 14, 16, 21, 25, 29, 30
Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.
Binding progresses correctly up to and including core 13 (see
end of file out.1.10.1rc2, before the error). That is 2 cores
on socket 0, and 2 cores on socket 1. Error is thrown when core
14 should be bound - extra core on socket 1 with no
corresponding core on socket 0. At that point the returned
trg_obj points to the first core on the node (os_index 0,
socket 0).
I have submitted a few other jobs and I always had an error in
such situation. Moreover, if I now use --map-by core instead of
socket, the error is gone, and I get my expected binding:
rank 0 @ compute-1-2.local 1, 17,
rank 1 @ compute-1-2.local 2, 18,
rank 2 @ compute-1-2.local 3, 19,
rank 3 @ compute-1-2.local 4, 20,
rank 4 @ compute-1-4.local 1, 17,
rank 5 @ compute-1-4.local 15, 31,
rank 6 @ compute-1-8.local 0, 16,
rank 7 @ compute-1-8.local 5, 21,
rank 8 @ compute-1-8.local 9, 25,
rank 9 @ compute-1-8.local 13, 29,
rank 10 @ compute-1-8.local 14, 30,
rank 11 @ compute-1-13.local 3, 19,
rank 12 @ compute-1-13.local 4, 20,
rank 13 @ compute-1-13.local 5, 21,
rank 14 @ compute-1-13.local 6, 22,
rank 15 @ compute-1-13.local 7, 23,
rank 16 @ compute-1-16.local 12, 28,
rank 17 @ compute-1-16.local 13, 29,
rank 18 @ compute-1-16.local 14, 30,
rank 19 @ compute-1-16.local 15, 31,
rank 20 @ compute-1-23.local 2, 18,
rank 29 @ compute-1-26.local 11, 27,
rank 21 @ compute-1-23.local 3, 19,
rank 30 @ compute-1-26.local 13, 29,
rank 22 @ compute-1-23.local 4, 20,
rank 31 @ compute-1-26.local 15, 31,
rank 23 @ compute-1-23.local 8, 24,
rank 27 @ compute-1-26.local 1, 17,
rank 24 @ compute-1-23.local 13, 29,
rank 28 @ compute-1-26.local 6, 22,
rank 25 @ compute-1-23.local 14, 30,
rank 26 @ compute-1-23.local 15, 31,
Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and
1.10.1rc1. However, there is still a difference in behavior
between 1.10.1rc1 and earlier versions. In the SLURM job
described in last post, 1.10.1rc1 fails to bind only in 1 case,
while the earlier versions fail in 21 out of 32 cases. You
mentioned there was a bug in hwloc. Not sure if it can explain
the difference in behavior.
Hope this helps to nail this down.
Marcin
On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
Ralph,
I suspect ompi tries to bind to threads outside the cpuset.
this could be pretty similar to a previous issue when ompi
tried to bind to cores outside the cpuset.
/* when a core has more than one thread, would ompi assume all
the threads are available if the core is available ? */
I will investigate this from tomorrow
Cheers,
Gilles
On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org>
wrote:
Thanks - please go ahead and release that allocation as
I’m not going to get to this immediately. I’ve got several
hot irons in the fire right now, and I’m not sure when
I’ll get a chance to track this down.
Gilles or anyone else who might have time - feel free to
take a gander and see if something pops out at you.
Ralph
On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com> wrote:
Done. I have compiled 1.10.0 and 1.10.rc1 with
--enable-debug and executed
mpirun --mca rmaps_base_verbose 10 --hetero-nodes
--report-bindings --bind-to core -np 32 ./affinity
In case of 1.10.rc1 I have also added :overload-allowed -
output in a separate file. This option did not make much
difference for 1.10.0, so I did not attach it here.
First thing I noted for 1.10.0 are lines like
[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
ON c1-26 IS NOT BOUND
with an empty BITMAP.
The SLURM environment is
set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
I have submitted an interactive job on screen for 120
hours now to work with one example, and not change it for
every post :)
If you need anything else, let me know. I could introduce
some patch/printfs and recompile, if you need it.
Marcin
On 10/03/2015 07:17 PM, Ralph Castain wrote:
Rats - just realized I have no way to test this as none
of the machines I can access are setup for cgroup-based
multi-tenant. Is this a debug version of OMPI? If not,
can you rebuild OMPI with —enable-debug?
Then please run it with —mca rmaps_base_verbose 10 and
pass along the output.
Thanks
Ralph
On Oct 3, 2015, at 10:09 AM, Ralph Castain
<r...@open-mpi.org> wrote:
What version of slurm is this? I might try to debug it
here. I’m not sure where the problem lies just yet.
On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com> wrote:
Here is the output of lstopo. In short, (0,16) are
core 0, (1,17) - core 1 etc.
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB)
+ Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB)
+ Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB)
+ Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB)
+ Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB)
+ Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB)
+ Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB)
+ Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB)
+ Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
PCIBridge
PCI 8086:1521
Net L#0 "eth0"
PCI 8086:1521
Net L#1 "eth1"
PCIBridge
PCI 15b3:1003
Net L#2 "ib0"
OpenFabrics L#3 "mlx4_0"
PCIBridge
PCI 102b:0532
PCI 8086:1d02
Block L#4 "sda"
NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) +
Core L#8
PU L#16 (P#8)
PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) +
Core L#9
PU L#18 (P#9)
PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10
(32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11
(32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12
(32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13
(32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14
(32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15
(32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#31)
On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm
nodelist syntax is a new one to me, but they tend to
change things around. Could you run lstopo on one of
those compute nodes and send the output?
I’m just suspicious because I’m not seeing a clear
pairing of HT numbers in your output, but HT
numbering is BIOS-specific and I may just not be
understanding your particular pattern. Our error
message is clearly indicating that we are seeing
individual HTs (and not complete cores) assigned, and
I don’t know the source of that confusion.
On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com> wrote:
On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you
will of course get the right mapping as we’ll just
inherit whatever we received.
Yes. I meant that whatever you received (what SLURM
gives) is a correct cpu map and assigns _whole_
CPUs, not a single HT to MPI processes. In the case
mentioned earlier openmpi should start 6 tasks on
c1-30. If HT would be treated as separate and
independent cores, sched_getaffinity of an MPI
process started on c1-30 would return a map with 6
entries only. In my case it returns a map with 12
entries - 2 for each core. So one process is in fact
allocated both HTs, not only one. Is what I'm saying
correct?
Looking at your output, it’s pretty clear that you
are getting independent HTs assigned and not full
cores.
How do you mean? Is the above understanding wrong? I
would expect that on c1-30 with --bind-to core
openmpi should bind to logical cores 0 and 16 (rank
0), 1 and 17 (rank 2) and so on. All those logical
cores are available in sched_getaffinity map, and
there is twice as many logical cores as there are
MPI processes started on the node.
My guess is that something in slurm has changed
such that it detects that HT has been enabled, and
then begins treating the HTs as completely
independent cpus.
Try changing “-bind-to core” to “-bind-to hwthread
-use-hwthread-cpus” and see if that works
I have and the binding is wrong. For example, I got
this output
rank 0 @ compute-1-30.local 0,
rank 1 @ compute-1-30.local 16,
Which means that two ranks have been bound to the
same physical core (logical cores 0 and 16 are two
HTs of the same core). If I use --bind-to core, I
get the following correct binding
rank 0 @ compute-1-30.local 0, 16,
The problem is many other ranks get bad binding with
'rank XXX is not bound (or bound to all available
processors)' warning.
But I think I was not entirely correct saying that
1.10.1rc1 did not fix things. It still might have
improved something, but not everything. Consider
this job:
SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
If I run 32 tasks as follows (with 1.10.1rc1)
mpirun --hetero-nodes --report-bindings --bind-to
core -np 32 ./affinity
I get the following error:
--------------------------------------------------------------------------
A request was made to bind to that would result in
binding more
processes than cpus on a resource:
Bind to: CORE
Node: c9-31
#processes: 2
#cpus: 1
You can override this protection by adding the
"overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
If I now use --bind-to core:overload-allowed, then
openmpi starts and _most_ of the threads are bound
correctly (i.e., map contains two logical cores in
ALL cases), except this case that required the
overload flag:
rank 15 @ compute-9-31.local 1, 17,
rank 16 @ compute-9-31.local 11, 27,
rank 17 @ compute-9-31.local 2, 18,
rank 18 @ compute-9-31.local 12, 28,
rank 19 @ compute-9-31.local 1, 17,
Note pair 1,17 is used twice. The original SLURM
delivered map (no binding) on this node is
rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
18, 27, 28, 29,
rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
18, 27, 28, 29,
rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
18, 27, 28, 29,
rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
18, 27, 28, 29,
rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
18, 27, 28, 29,
Why does openmpi use cores (1,17) twice instead of
using core (13,29)? Clearly, the original
SLURM-delivered map has 5 CPUs included, enough for
5 MPI processes.
Cheers,
Marcin
On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski
<marcin.krotkiew...@gmail.com> wrote:
On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that
Slurm may be treating HTs as “cores” - i.e., as
independent cpus. Any chance that is true?
Not to the best of my knowledge, and at least not
intentionally. SLURM starts as many processes as
there are physical cores, not threads. To verify
this, consider this test case:
_______________________________________________
users mailing list
us...@open-mpi.org
<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/10/27790.php
_______________________________________________
users mailing list
us...@open-mpi.org
<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/10/27791.php
_______________________________________________
users mailing list
us...@open-mpi.org
<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/10/27792.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/10/27814.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27815.php
diff --git a/ompi/mpiext/affinity/c/mpiext_affinity_str.c
b/ompi/mpiext/affinity/c/mpiext_affinity_str.c
index 62fa0cc..a4d98c4 100644
--- a/ompi/mpiext/affinity/c/mpiext_affinity_str.c
+++ b/ompi/mpiext/affinity/c/mpiext_affinity_str.c
@@ -108,7 +108,8 @@ static int get_rsrc_ompi_bound(char
str[OMPI_AFFINITY_STRING_MAX])
} else {
ret = opal_hwloc_base_cset2str(str, OMPI_AFFINITY_STRING_MAX,
opal_hwloc_topology,
- orte_proc_applied_binding);
+ orte_proc_applied_binding,
+ OPAL_BIND_TO_NONE);
}
if (OPAL_ERR_NOT_BOUND == ret) {
strncpy(str, not_bound_str, OMPI_AFFINITY_STRING_MAX - 1);
@@ -159,7 +160,8 @@ static int get_rsrc_current_binding(char
str[OMPI_AFFINITY_STRING_MAX])
else {
ret = opal_hwloc_base_cset2str(str, OMPI_AFFINITY_STRING_MAX,
opal_hwloc_topology,
- boundset);
+ boundset,
+ OPAL_BIND_TO_NONE);
if (OPAL_ERR_NOT_BOUND == ret) {
strncpy(str, not_bound_str, OMPI_AFFINITY_STRING_MAX - 1);
ret = OMPI_SUCCESS;
diff --git a/opal/mca/hwloc/base/base.h b/opal/mca/hwloc/base/base.h
index 826aeb8..5c83433 100644
--- a/opal/mca/hwloc/base/base.h
+++ b/opal/mca/hwloc/base/base.h
@@ -1,6 +1,8 @@
/*
* Copyright (c) 2011-2012 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2013-2015 Intel, Inc. All rights reserved.
+ * Copyright (c) 2015 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -246,13 +248,20 @@ OPAL_DECLSPEC int opal_hwloc_print(char **output, char
*prefix,
hwloc_topology_t src,
opal_data_type_t type);
+/*
+ * convert an opal_binding_policy_t to an hwloc_obj_t
+ */
+OPAL_DECLSPEC unsigned int opal_hwloc_base_opal_binding_policy2hwloc_obj(
+ opal_binding_policy_t binding);
+
/**
* Make a prettyprint string for a hwloc_cpuset_t (e.g., "socket
* 2[core 3]").
*/
OPAL_DECLSPEC int opal_hwloc_base_cset2str(char *str, int len,
hwloc_topology_t topo,
- hwloc_cpuset_t cpuset);
+ hwloc_cpuset_t cpuset,
+ opal_binding_policy_t binding);
/**
* Make a prettyprint string for a cset in a map format.
diff --git a/opal/mca/hwloc/base/hwloc_base_dt.c
b/opal/mca/hwloc/base/hwloc_base_dt.c
index 13763ea..1c061e0 100644
--- a/opal/mca/hwloc/base/hwloc_base_dt.c
+++ b/opal/mca/hwloc/base/hwloc_base_dt.c
@@ -105,7 +105,6 @@ int opal_hwloc_unpack(opal_buffer_t *buffer, void *dest,
* explicitly set a flag so hwloc sets things up correctly
*/
if (0 != hwloc_topology_set_flags(t,
(HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM |
- HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM
|
HWLOC_TOPOLOGY_FLAG_IO_DEVICES))) {
rc = OPAL_ERROR;
hwloc_topology_destroy(t);
diff --git a/opal/mca/hwloc/base/hwloc_base_util.c
b/opal/mca/hwloc/base/hwloc_base_util.c
index cd429ee..b844e59 100644
--- a/opal/mca/hwloc/base/hwloc_base_util.c
+++ b/opal/mca/hwloc/base/hwloc_base_util.c
@@ -248,8 +248,7 @@ int opal_hwloc_base_get_topology(void)
if (NULL == opal_hwloc_base_topo_file) {
if (0 != hwloc_topology_init(&opal_hwloc_topology) ||
0 != hwloc_topology_set_flags(opal_hwloc_topology,
- (HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM |
- HWLOC_TOPOLOGY_FLAG_IO_DEVICES)) ||
+ HWLOC_TOPOLOGY_FLAG_IO_DEVICES) ||
0 != hwloc_topology_load(opal_hwloc_topology)) {
return OPAL_ERR_NOT_SUPPORTED;
}
@@ -294,7 +293,6 @@ int opal_hwloc_base_set_topology(char *topofile)
*/
if (0 != hwloc_topology_set_flags(opal_hwloc_topology,
(HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM |
- HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM |
HWLOC_TOPOLOGY_FLAG_IO_DEVICES))) {
hwloc_topology_destroy(opal_hwloc_topology);
return OPAL_ERR_NOT_SUPPORTED;
@@ -492,10 +490,6 @@ static void df_search_cores(hwloc_obj_t obj, unsigned int
*cnt)
obj->userdata = (void*)data;
}
if (NULL == opal_hwloc_base_cpu_set) {
- if (!hwloc_bitmap_isincluded(obj->cpuset, obj->allowed_cpuset)) {
- /* do not count not allowed cores */
- return;
- }
data->npus = 1;
}
*cnt += data->npus;
@@ -1782,11 +1776,34 @@ static int build_map(int *num_sockets_arg, int
*num_cores_arg,
}
/*
+ * convert an opal_binding_policy_t to an hwloc_obj_t
+ */
+unsigned int opal_hwloc_base_opal_binding_policy2hwloc_obj(
+ opal_binding_policy_t binding)
+{
+ switch (OPAL_GET_BINDING_POLICY(binding)) {
+ case OPAL_BIND_TO_BOARD:
+ return HWLOC_OBJ_MACHINE;
+ case OPAL_BIND_TO_NUMA:
+ return HWLOC_OBJ_NUMANODE;
+ case OPAL_BIND_TO_SOCKET:
+ return HWLOC_OBJ_PACKAGE;
+ case OPAL_BIND_TO_CORE:
+ return HWLOC_OBJ_CORE;
+ case OPAL_BIND_TO_HWTHREAD:
+ return HWLOC_OBJ_PU;
+ default:
+ return HWLOC_OBJ_TYPE_MAX;
+ }
+}
+
+/*
* Make a prettyprint string for a hwloc_cpuset_t
*/
int opal_hwloc_base_cset2str(char *str, int len,
hwloc_topology_t topo,
- hwloc_cpuset_t cpuset)
+ hwloc_cpuset_t cpuset,
+ opal_binding_policy_t binding)
{
bool first;
int num_sockets, num_cores;
@@ -1804,7 +1821,8 @@ int opal_hwloc_base_cset2str(char *str, int len,
return OPAL_ERR_NOT_BOUND;
}
- /* if the cpuset includes all available cpus, then we are unbound */
+ /* if the cpuset includes all available cpus and unless requested
+ * by the binding policy, then we are unbound, */
root = hwloc_get_root_obj(topo);
if (NULL == root->userdata) {
opal_hwloc_base_filter_cpus(topo);
@@ -1813,7 +1831,9 @@ int opal_hwloc_base_cset2str(char *str, int len,
if (NULL == sum->available) {
return OPAL_ERROR;
}
- if (0 != hwloc_bitmap_isincluded(sum->available, cpuset)) {
+ if (0 != hwloc_bitmap_isincluded(sum->available, cpuset) &&
+ (!OPAL_BINDING_POLICY_IS_SET(binding) ||
+ 1 != opal_hwloc_base_get_nbobjs_by_type(topo,
opal_hwloc_base_opal_binding_policy2hwloc_obj(binding), 0,
OPAL_HWLOC_LOGICAL))) {
return OPAL_ERR_NOT_BOUND;
}
}
diff --git a/orte/mca/ess/base/ess_base_fns.c b/orte/mca/ess/base/ess_base_fns.c
index ab12172..43d6c63 100644
--- a/orte/mca/ess/base/ess_base_fns.c
+++ b/orte/mca/ess/base/ess_base_fns.c
@@ -282,7 +282,7 @@ int orte_ess_base_proc_binding(void)
/* report the binding, if requested */
if (opal_hwloc_report_bindings || 4 <
opal_output_get_verbosity(orte_ess_base_framework.framework_output)) {
char tmp1[1024], tmp2[1024];
- if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), opal_hwloc_topology, mycpus)) {
+ if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), opal_hwloc_topology, mycpus, OPAL_BIND_TO_NONE)) {
opal_output(0, "MCW rank %d is not bound (or bound to all
available processors)", ORTE_PROC_MY_NAME->vpid);
} else {
opal_hwloc_base_cset2mapstr(tmp2, sizeof(tmp2),
opal_hwloc_topology, mycpus);
diff --git a/orte/mca/rmaps/base/help-orte-rmaps-base.txt
b/orte/mca/rmaps/base/help-orte-rmaps-base.txt
index ac89d7e..770a155 100644
--- a/orte/mca/rmaps/base/help-orte-rmaps-base.txt
+++ b/orte/mca/rmaps/base/help-orte-rmaps-base.txt
@@ -338,3 +338,11 @@ or provide more node locations in the file.
The request to map processes by distance could not be completed
because device to map near by was not specified. Please, use
rmaps_dist_device mca parameter to set it.
+
+[conflicting-policies]
+The requested mapping and binding policies makes little sense:
+
+ Mapping policy: %s
+ Binding policy: %s
+
+If this is what you really want to do, you can ignore this message.
diff --git a/orte/mca/rmaps/base/rmaps_base_binding.c
b/orte/mca/rmaps/base/rmaps_base_binding.c
index cf35a81..9e67ba6 100644
--- a/orte/mca/rmaps/base/rmaps_base_binding.c
+++ b/orte/mca/rmaps/base/rmaps_base_binding.c
@@ -320,28 +320,50 @@ static int bind_downwards(orte_job_t *jdata,
trg_obj->userdata = data;
}
data->num_bound++;
- /* error out if adding a proc would cause overload and that wasn't
allowed,
- * and it wasn't a default binding policy (i.e., the user
requested it)
+ /* before thinking of overloading a resource,
+ * try to find some not yet oversubscribed resource
*/
- if (ncpus < data->num_bound &&
- !OPAL_BIND_OVERLOAD_ALLOWED(jdata->map->binding)) {
- if (OPAL_BINDING_POLICY_IS_SET(jdata->map->binding)) {
- /* if the user specified a binding policy, then we cannot
meet
- * it since overload isn't allowed, so error out - have the
- * message indicate that setting overload allowed will
remove
- * this restriction */
- orte_show_help("help-orte-rmaps-base.txt",
"rmaps:binding-overload", true,
-
opal_hwloc_base_print_binding(map->binding), node->name,
- data->num_bound, ncpus);
- hwloc_bitmap_free(totalcpuset);
- return ORTE_ERR_SILENT;
- } else {
- /* if we have the default binding policy, then just don't
bind */
- OPAL_SET_BINDING_POLICY(map->binding, OPAL_BIND_TO_NONE);
- unbind_procs(jdata);
- hwloc_bitmap_zero(totalcpuset);
- return ORTE_SUCCESS;
+ if (ncpus < data->num_bound) {
+ hwloc_obj_t alt_obj;
+ unsigned int alt_ncpus = 0;
+ opal_hwloc_obj_data_t *alt_data = NULL;
+ assert (1 == hwloc_get_nbobjs_by_depth(node->topology, 0));
+ alt_obj =
opal_hwloc_base_find_min_bound_target_under_obj(node->topology,
+
hwloc_get_root_obj(node->topology),
+
target, cache_level);
+ assert (NULL != alt_obj);
+ alt_ncpus = opal_hwloc_base_get_npus(node->topology, alt_obj);
+ if (NULL == (alt_data =
(opal_hwloc_obj_data_t*)alt_obj->userdata)) {
+ alt_data = OBJ_NEW(opal_hwloc_obj_data_t);
+ alt_obj->userdata = alt_data;
+ }
+ /* error out if adding a proc would cause overload and that
wasn't allowed,
+ * and it wasn't a default binding policy (i.e., the user
requested it)
+ */
+ if (!OPAL_BIND_OVERLOAD_ALLOWED(jdata->map->binding)) {
+ if (alt_ncpus < alt_data->num_bound) {
+ if (OPAL_BINDING_POLICY_IS_SET(jdata->map->binding)) {
+ /* if the user specified a binding policy, then we
cannot meet
+ * it since overload isn't allowed, so error out -
have the
+ * message indicate that setting overload allowed
will remove
+ * this restriction */
+ orte_show_help("help-orte-rmaps-base.txt",
"rmaps:binding-overload", true,
+
opal_hwloc_base_print_binding(map->binding), node->name,
+ data->num_bound, ncpus);
+ hwloc_bitmap_free(totalcpuset);
+ return ORTE_ERR_SILENT;
+ } else {
+ /* if we have the default binding policy, then
just don't bind */
+ OPAL_SET_BINDING_POLICY(map->binding,
OPAL_BIND_TO_NONE);
+ unbind_procs(jdata);
+ hwloc_bitmap_zero(totalcpuset);
+ return ORTE_SUCCESS;
+ }
+ }
}
+ alt_data->num_bound++;
+ data->num_bound--;
+ trg_obj = alt_obj;
}
/* bind the proc here */
cpus = opal_hwloc_base_get_available_cpus(node->topology, trg_obj);
@@ -363,7 +385,7 @@ static int bind_downwards(orte_job_t *jdata,
if (4 <
opal_output_get_verbosity(orte_rmaps_base_framework.framework_output)) {
char tmp1[1024], tmp2[1024];
if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1),
- node->topology,
totalcpuset)) {
+ node->topology,
totalcpuset, OPAL_BIND_TO_NONE)) {
opal_output(orte_rmaps_base_framework.framework_output,
"%s PROC %s ON %s IS NOT BOUND",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
diff --git a/orte/mca/rmaps/base/rmaps_base_frame.c
b/orte/mca/rmaps/base/rmaps_base_frame.c
index d11658c..105134c 100644
--- a/orte/mca/rmaps/base/rmaps_base_frame.c
+++ b/orte/mca/rmaps/base/rmaps_base_frame.c
@@ -504,6 +504,15 @@ static int orte_rmaps_base_open(mca_base_open_flag_t flags)
opal_hwloc_binding_policy |= OPAL_BIND_ALLOW_OVERLOAD;
}
+ if (ORTE_GET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping) &&
+ OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) &&
+ (ORTE_GET_MAPPING_POLICY(orte_rmaps_base.mapping) >
+ OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy))) {
+ orte_show_help("help-orte-rmaps-base.txt", "conflicting-policies",
true,
+ orte_rmaps_base_print_mapping(orte_rmaps_base.mapping),
+
opal_hwloc_base_print_binding(opal_hwloc_binding_policy));
+ }
+
/* should we display a detailed (developer-quality) version of the map
after determining it? */
if (rmaps_base_display_devel_map) {
orte_rmaps_base.display_map = true;
diff --git a/orte/mca/rtc/hwloc/rtc_hwloc.c b/orte/mca/rtc/hwloc/rtc_hwloc.c
index 91cb183..0d5254f 100644
--- a/orte/mca/rtc/hwloc/rtc_hwloc.c
+++ b/orte/mca/rtc/hwloc/rtc_hwloc.c
@@ -214,7 +214,7 @@ static void set(orte_job_t *jobdat,
opal_output(0, "MCW rank %d is not bound",
child->name.vpid);
} else {
- if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), opal_hwloc_topology, mycpus)) {
+ if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), opal_hwloc_topology, mycpus, jobdat->map->binding)) {
opal_output(0, "MCW rank %d is not bound (or bound to all
available processors)", child->name.vpid);
} else {
opal_hwloc_base_cset2mapstr(tmp2, sizeof(tmp2),
opal_hwloc_topology, mycpus);
diff --git a/orte/runtime/data_type_support/orte_dt_print_fns.c
b/orte/runtime/data_type_support/orte_dt_print_fns.c
index 9bf84f4..c77baa7 100644
--- a/orte/runtime/data_type_support/orte_dt_print_fns.c
+++ b/orte/runtime/data_type_support/orte_dt_print_fns.c
@@ -477,7 +477,7 @@ int orte_dt_print_proc(char **output, char *prefix,
orte_proc_t *src, opal_data_
NULL != src->node->topology) {
mycpus = hwloc_bitmap_alloc();
hwloc_bitmap_list_sscanf(mycpus, cpu_bitmap);
- if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), src->node->topology, mycpus)) {
+ if (OPAL_ERR_NOT_BOUND == opal_hwloc_base_cset2str(tmp1,
sizeof(tmp1), src->node->topology, mycpus, OPAL_BIND_TO_NONE)) {
str = strdup("UNBOUND");
} else {
opal_hwloc_base_cset2mapstr(tmp2, sizeof(tmp2),
src->node->topology, mycpus);