Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet

OK, i'll see what i can do :-)

On 10/5/2015 12:39 PM, Ralph Castain wrote:
I would consider that a bug, myself - if there is some resource 
available, we should use it



On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet > wrote:


Marcin,

i ran a simple test with v1.10.1rc1 under a cpuset with
- one core (two threads 0,16) on socket 0
- two cores (two threads each 8,9,24,25) on socket 1

$ mpirun -np 3 -bind-to core ./hello_c
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:rapid
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

as you already pointed, default mapping is by socket.

so on one hand, we can consider this behavior is a feature :
we try to bind two processes to socket 0, so the --oversubscribe 
option is required

(and it does what it should :
$ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c
[rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
[rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../../../..][BB/../../../../../../..]
[rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)


and on the other hand, we could consider ompi should be a bit 
smarter, and uses socket 1 for task 2 since socket 0 is fully 
allocated and there is room on socket 1.


Ralph, any thoughts ? bug or feature ?


Marcin,

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun

Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do 
not know the reason, but I think I know when the problem happens (at 
least on 1.10.1rc1). It seems that by default openmpi maps by 
socket. The error happens when for a given compute node there is a 
different number of cores used on each socket. Consider previously 
studied case (the debug outputs I sent in last post). c1-8, which 
was source of error, has 5 mpi processes assigned, and the cpuset is 
the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - 
extra core on socket 1 with no corresponding core on socket 0. At 
that point the returned trg_obj points to the first core on the node 
(os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, 
the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
ran

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet

Ralph and Marcin,

here is a proof of concept for a fix (assert should be replaced with 
proper error handling)

for v1.10 branch.
if you have any chance to test it, please let me know the results

Cheers,

Gilles

On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:

OK, i'll see what i can do :-)

On 10/5/2015 12:39 PM, Ralph Castain wrote:
I would consider that a bug, myself - if there is some resource 
available, we should use it



On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet > wrote:


Marcin,

i ran a simple test with v1.10.1rc1 under a cpuset with
- one core (two threads 0,16) on socket 0
- two cores (two threads each 8,9,24,25) on socket 1

$ mpirun -np 3 -bind-to core ./hello_c
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:rapid
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

as you already pointed, default mapping is by socket.

so on one hand, we can consider this behavior is a feature :
we try to bind two processes to socket 0, so the --oversubscribe 
option is required

(and it does what it should :
$ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c
[rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
[rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../../../..][BB/../../../../../../..]
[rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)


and on the other hand, we could consider ompi should be a bit 
smarter, and uses socket 1 for task 2 since socket 0 is fully 
allocated and there is room on socket 1.


Ralph, any thoughts ? bug or feature ?


Marcin,

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun

Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do 
not know the reason, but I think I know when the problem happens 
(at least on 1.10.1rc1). It seems that by default openmpi maps by 
socket. The error happens when for a given compute node there is a 
different number of cores used on each socket. Consider previously 
studied case (the debug outputs I sent in last post). c1-8, which 
was source of error, has 5 mpi processes assigned, and the cpuset 
is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 
2 cores on socket 1. Error is thrown when core 14 should be bound - 
extra core on socket 1 with no corresponding core on socket 0. At 
that point the returned trg_obj points to the first core on the 
node (os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, 
the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Ralph Castain
I think this is okay, in general. I would only make one change: I would only 
search for an alternative site if the binding policy wasn’t set by the user. If 
the user specifies a mapping/binding pattern, then we should error out as we 
cannot meet it.

I did think of one alternative that might be worth considering. If we have a 
hetero topology, then we know that things are going to be a little unusual. In 
that case, we could just default to map-by core (or hwthread if 
—use-hwthread-cpus was given) and then things would be fine even in 
non-symmetric topologies. Likewise, if we have a homogeneous topology, we could 
just quickly check for symmetry on our base topology (the one we will use for 
mapping) and default to map-by core if non-symmetric.

I suggest it only because we otherwise wind up with some oddball hybrid mapping 
scheme. In the case we have here, procs would be mapped by socket except where 
we have an extra core, where they would look like they were mapped by core. 
Impossible to predict how the app will react to it.

The alternative be a more predictable pattern - would it make more sense?

Ralph


> On Oct 5, 2015, at 1:13 AM, Gilles Gouaillardet  wrote:
> 
> Ralph and Marcin,
> 
> here is a proof of concept for a fix (assert should be replaced with proper 
> error handling)
> for v1.10 branch.
> if you have any chance to test it, please let me know the results
> 
> Cheers,
> 
> Gilles
> 
> On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:
>> OK, i'll see what i can do :-)
>> 
>> On 10/5/2015 12:39 PM, Ralph Castain wrote:
>>> I would consider that a bug, myself - if there is some resource available, 
>>> we should use it
>>> 
>>> 
 On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet >>> > wrote:
 
 Marcin,
 
 i ran a simple test with v1.10.1rc1 under a cpuset with
 - one core (two threads 0,16) on socket 0
 - two cores (two threads each 8,9,24,25) on socket 1
 
 $ mpirun -np 3 -bind-to core ./hello_c 
 --
 A request was made to bind to that would result in binding more
 processes than cpus on a resource:
 
Bind to: CORE
Node:rapid
#processes:  2
#cpus:   1
 
 You can override this protection by adding the "overload-allowed"
 option to your binding directive.
 --
 
 as you already pointed, default mapping is by socket.
 
 so on one hand, we can consider this behavior is a feature :
 we try to bind two processes to socket 0, so the --oversubscribe option is 
 required
 (and it does what it should :
 $ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c 
 [rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
 [BB/../../../../../../..][../../../../../../../..]
 [rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
 [../../../../../../../..][BB/../../../../../../..]
 [rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: 
 [BB/../../../../../../..][../../../../../../../..]
 Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
 gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
 v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
 Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
 gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
 v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
 Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
 gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
 v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
 
 and on the other hand, we could consider ompi should be a bit smarter, and 
 uses socket 1 for task 2 since socket 0 is fully allocated and there is 
 room on socket 1.
 
 Ralph, any thoughts ? bug or feature ? 
 
 
 Marcin,
 
 you mentionned you had one failure with 1.10.1rc1 and -bind-to core
 could you please send the full details (script, allocation and output)
 in your slurm script, you can do
 srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
 Cpus_allowed_list /proc/self/status
 before invoking mpirun
 
 Cheers,
 
 Gilles
 
 On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
> Hi, all,
> 
> I played a bit more and it seems that the problem results from
> 
> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
> 
> called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
> know the reason, but I think I know when the problem happens (at least on 
> 1.10.1rc1). It seems that by default openmpi maps by socket. The error 
> happens when for a given compute node there is a different number of 
> cores used on each socket. Consider previously studied case (the

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski


I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it did 
not help - I am not sure how much (if) you want pursue this.


For 1.10.1rc1 I was so far unable to reproduce any binding problems with 
jobs of up to 128 tasks. Some cosmetic suggestions. The warning it all 
started with says


MCW rank X is not bound (or bound to all available processors)

1. One thing I already mentioned is that this warning is only displayed 
when using --report-bindings, and instead of printing the actual 
binding. I would suggest moving the warning somewhere else (maybe the 
bind_downwards/upwards functions?), and instead just show the binding in 
question. It might be trivial for homogeneous allocations, but is 
non-obvious with the type of SLURM jobs discussed. Also, seeing the 
warning only on the condition that --report-bindings was used, 
especially if the user specified binding policy manually, is IMHO wrong 
- OpenMPI should notify about failure somehow instead of quietly binding 
to all cores.


2. Another question altogether is if the warning should exist at all 
(instead of error, as proposed by Ralph). I still get that warning, even 
with 1.10.1rc1, in situations, in which I think it should not be 
displayed. In the simplest case the warning is printed when only 1 MPI 
task is running on a node. Obviously the statement is correct since the 
task is using all allocated CPUs, but its not useful. A more nontrivial 
case is when using '--bind-to socket', and when all MPI ranks are 
allocated on only one socket. Again, effectively all MPI ranks use all 
assigned cores and the warning is technically speaking correct, but 
misleading. Instead, as discussed in 1., it would be good to see the 
actual binding printed instead of the warning.


3. When I specify '--map-by hwthread --bind-to core', then I get 
multiple MPI processes bound to the same physical core without actually 
specifying --oversubscribe. Just a question whether it should be like 
this, but maybe yes.



On 10/05/2015 11:00 AM, Ralph Castain wrote:
I think this is okay, in general. I would only make one change: I 
would only search for an alternative site if the binding policy wasn’t 
set by the user. If the user specifies a mapping/binding pattern, then 
we should error out as we cannot meet it.




I think that would result in a non-transparent behavior in certain 
cases. By default mapping is done by socket, and OpenMPI could behave 
differently if '--map-by socket' is explicitly supplied on the command 
line - i.e., error out in jobs like discussed. Is this a good idea?


Introducing an error here is also a bit tricky. Consider allocating 5 
mpi processes to 2 sockets. You would get an error in this type of 
distribution:


socket 0: 2 jobs
socket 1: 3 jobs

but not in this one

socket 0: 3 jobs
socket 1: 2 jobs

just because you start counting from socket 0.

I did think of one alternative that might be worth considering. If we 
have a hetero topology, then we know that things are going to be a 
little unusual. In that case, we could just default to map-by core (or 
hwthread if —use-hwthread-cpus was given) and then things would be 
fine even in non-symmetric topologies. Likewise, if we have a 
homogeneous topology, we could just quickly check for symmetry on our 
base topology (the one we will use for mapping) and default to map-by 
core if non-symmetric.


Having different default options for different cases becomes difficult 
to manage and understand. If I could vote, I would rather go for an 
informative error. Or to switch to '--map-by core' as default for all 
cases ;) (probably not gonna happen..)


Removing support of '--map-by socket' altogether for this type of jobs 
is likely OK - don't know. I personally like the new way it works - if 
there are resources, use them. But if you end up removing this 
possibility it would probably be good to put it into SLURM related doc 
and produce some meaningful error.



Marcin



I suggest it only because we otherwise wind up with some oddball 
hybrid mapping scheme. In the case we have here, procs would be mapped 
by socket except where we have an extra core, where they would look 
like they were mapped by core. Impossible to predict how the app will 
react to it.







The alternative be a more predictable pattern - would it make more sense?

Ralph


On Oct 5, 2015, at 1:13 AM, Gilles Gouaillardet > wrote:


Ralph and Marcin,

here is a proof of concept for a fix (assert should be replaced with 
proper error handling)

for v1.10 branch.
if you have any chance to test it, please let me know the results

Cheers,

Gilles

On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:

OK, i'll see what i can do :-)

On 10/5/2015 12:39 PM, Ralph Castain wrote:
I would consider that a bug, myself - if there is some resource 
available, we should use it



On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet > wrote:


Marcin,

i ran a simple test with v1.10

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet
Marcin,

there is no need to pursue 1.10.0 since it is known to be broken for some
scenario.

it would really help me if you could provide the logs I requested, so I can
reproduce the issue and make sure we both talk about the same scenario.

imho, there is no legitimate reason to -map-by hwthread -bind-to core.
we might even want to issue a warning to tell the end user he might not be
doing what he expects.

I will double check the warning about one task using all the cores if there
is only one core available.
there should be no warning at all in this case.

I will give mote thoughts to the alternative suggested by Ralph.
imho, bad things will happen whatever the policy we choose if slurm
assigns more than one job per socket : most real world applications are
memory bound, and sharing a socket makes the performance very unpredictable
anyway, and regardless ompi binding policy.

Cheers,

Gilles

On Monday, October 5, 2015, marcin.krotkiewski 
wrote:

>
> I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it did
> not help - I am not sure how much (if) you want pursue this.
>
> For 1.10.1rc1 I was so far unable to reproduce any binding problems with
> jobs of up to 128 tasks. Some cosmetic suggestions. The warning it all
> started with says
>
> MCW rank X is not bound (or bound to all available processors)
>
> 1. One thing I already mentioned is that this warning is only displayed
> when using --report-bindings, and instead of printing the actual binding. I
> would suggest moving the warning somewhere else (maybe the
> bind_downwards/upwards functions?), and instead just show the binding in
> question. It might be trivial for homogeneous allocations, but is
> non-obvious with the type of SLURM jobs discussed. Also, seeing the warning
> only on the condition that --report-bindings was used, especially if the
> user specified binding policy manually, is IMHO wrong - OpenMPI should
> notify about failure somehow instead of quietly binding to all cores.
>
> 2. Another question altogether is if the warning should exist at all
> (instead of error, as proposed by Ralph). I still get that warning, even
> with 1.10.1rc1, in situations, in which I think it should not be displayed.
> In the simplest case the warning is printed when only 1 MPI task is running
> on a node. Obviously the statement is correct since the task is using all
> allocated CPUs, but its not useful. A more nontrivial case is when using
> '--bind-to socket', and when all MPI ranks are allocated on only one
> socket. Again, effectively all MPI ranks use all assigned cores and the
> warning is technically speaking correct, but misleading. Instead, as
> discussed in 1., it would be good to see the actual binding printed instead
> of the warning.
>
> 3. When I specify '--map-by hwthread --bind-to core', then I get multiple
> MPI processes bound to the same physical core without actually specifying
> --oversubscribe. Just a question whether it should be like this, but maybe
> yes.
>
>
> On 10/05/2015 11:00 AM, Ralph Castain wrote:
>
> I think this is okay, in general. I would only make one change: I would
> only search for an alternative site if the binding policy wasn’t set by the
> user. If the user specifies a mapping/binding pattern, then we should error
> out as we cannot meet it.
>
>
> I think that would result in a non-transparent behavior in certain cases.
> By default mapping is done by socket, and OpenMPI could behave differently
> if '--map-by socket' is explicitly supplied on the command line - i.e.,
> error out in jobs like discussed. Is this a good idea?
>
> Introducing an error here is also a bit tricky. Consider allocating 5 mpi
> processes to 2 sockets. You would get an error in this type of distribution:
>
> socket 0: 2 jobs
> socket 1: 3 jobs
>
> but not in this one
>
> socket 0: 3 jobs
> socket 1: 2 jobs
>
> just because you start counting from socket 0.
>
> I did think of one alternative that might be worth considering. If we have
> a hetero topology, then we know that things are going to be a little
> unusual. In that case, we could just default to map-by core (or hwthread if
> —use-hwthread-cpus was given) and then things would be fine even in
> non-symmetric topologies. Likewise, if we have a homogeneous topology, we
> could just quickly check for symmetry on our base topology (the one we will
> use for mapping) and default to map-by core if non-symmetric.
>
>
> Having different default options for different cases becomes difficult to
> manage and understand. If I could vote, I would rather go for an
> informative error. Or to switch to '--map-by core' as default for all cases
> ;) (probably not gonna happen..)
>
> Removing support of '--map-by socket' altogether for this type of jobs is
> likely OK - don't know. I personally like the new way it works - if there
> are resources, use them. But if you end up removing this possibility it
> would probably be good to put it into SLURM related doc and produce some
> mean

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski

Hi, Gilles

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun


It was an interactive job allocated with

salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

The slurm environment is the following

SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

The output of the command you asked for is

0: c1-2.local  Cpus_allowed_list:1-4,17-20
1: c1-4.local  Cpus_allowed_list:1,15,17,31
2: c1-8.local  Cpus_allowed_list:0,5,9,13-14,16,21,25,29-30
3: c1-13.local  Cpus_allowed_list:   3-7,19-23
4: c1-16.local  Cpus_allowed_list:   12-15,28-31
5: c1-23.local  Cpus_allowed_list:   2-4,8,13-15,18-20,24,29-31
6: c1-26.local  Cpus_allowed_list:   1,6,11,13,15,17,22,27,29,31

Running with command

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
--report-bindings --map-by socket -np 32 ./affinity


I have attached two output files: one for the original 1.10.1rc1, one 
for the patched version.


When I said 'failed in one case' I was not precise. I got an error on 
node c1-8, which was the first one to have different number of MPI 
processes on the two sockets. It would also fail on some later nodes, 
just that because of the error we never got there.


Let me know if you need more.

Marcin








Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
know the reason, but I think I know when the problem happens (at 
least on 1.10.1rc1). It seems that by default openmpi maps by socket. 
The error happens when for a given compute node there is a different 
number of cores used on each socket. Consider previously studied case 
(the debug outputs I sent in last post). c1-8, which was source of 
error, has 5 mpi processes assigned, and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - 
extra core on socket 1 with no corresponding core on socket 0. At 
that point the returned trg_obj points to the first core on the node 
(os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, 
the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior between 
1.10.1rc1 and earlier versions. In the SLURM job described in last 
post, 1.10.1rc1 fails to bind only in 1 case, while the earlier 
versions fail in 21 out of 32 cases. You mentioned there was a bug in 
hwloc. Not sure if it can explain the difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

I suspect ompi tries to bind to threads outside the cpuset.
this

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Ralph Castain
Thanks Marcin. I think we have three things we need to address:

1. the warning needs to be emitted regardless of whether or not 
—report-bindings was given. Not sure how that warning got “covered” by the 
option, but it is clearly a bug

2. improve the warning to include binding info - relatively easy to do

3. fix the mapping/binding under asymmetric topologies. Given further info and 
consideration, I’m increasingly pushed towards the “fallback to the map-by core 
default” solution. It provides a predictable and consistent pattern. The other 
solution is technically viable, but leads to an unpredictable “opportunistic” 
result that might cause odd application behavior. If the user specifies a 
mapping option and we can’t do it because of asymmetry, then error out.

HTH
Ralph


> On Oct 5, 2015, at 9:36 AM, marcin.krotkiewski  
> wrote:
> 
> Hi, Gilles
>> 
>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
>> could you please send the full details (script, allocation and output)
>> in your slurm script, you can do
>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
>> Cpus_allowed_list /proc/self/status
>> before invoking mpirun
>> 
> It was an interactive job allocated with
> 
> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
> 
> The slurm environment is the following
> 
> SLURM_JOBID=12714491
> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> SLURM_JOB_ID=12714491
> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_JOB_NUM_NODES=7
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=7
> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=32
> SLURM_NTASKS=32
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-1.local
> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
> 
> The output of the command you asked for is
> 
> 0: c1-2.local  Cpus_allowed_list:1-4,17-20
> 1: c1-4.local  Cpus_allowed_list:1,15,17,31
> 2: c1-8.local  Cpus_allowed_list:0,5,9,13-14,16,21,25,29-30
> 3: c1-13.local  Cpus_allowed_list:   3-7,19-23
> 4: c1-16.local  Cpus_allowed_list:   12-15,28-31
> 5: c1-23.local  Cpus_allowed_list:   2-4,8,13-15,18-20,24,29-31
> 6: c1-26.local  Cpus_allowed_list:   1,6,11,13,15,17,22,27,29,31
> 
> Running with command
> 
> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
> --report-bindings --map-by socket -np 32 ./affinity
> 
> I have attached two output files: one for the original 1.10.1rc1, one for the 
> patched version.
> 
> When I said 'failed in one case' I was not precise. I got an error on node 
> c1-8, which was the first one to have different number of MPI processes on 
> the two sockets. It would also fail on some later nodes, just that because of 
> the error we never got there.
> 
> Let me know if you need more.
> 
> Marcin
> 
> 
> 
> 
> 
> 
> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
>>> Hi, all,
>>> 
>>> I played a bit more and it seems that the problem results from
>>> 
>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
>>> 
>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do not know 
>>> the reason, but I think I know when the problem happens (at least on 
>>> 1.10.1rc1). It seems that by default openmpi maps by socket. The error 
>>> happens when for a given compute node there is a different number of cores 
>>> used on each socket. Consider previously studied case (the debug outputs I 
>>> sent in last post). c1-8, which was source of error, has 5 mpi processes 
>>> assigned, and the cpuset is the following:
>>> 
>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
>>> 
>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
>>> progresses correctly up to and including core 13 (see end of file 
>>> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores 
>>> on socket 1. Error is thrown when core 14 should be bound - extra core on 
>>> socket 1 with no corresponding core on socket 0. At that point the returned 
>>> trg_obj points to the first core on the node (os_index 0, socket 0).
>>> 
>>> I have submitted a few other jobs and I always had an error in such 
>>> situation. Moreover, if I now use --map-by core instead of socket, the 
>>> error is gone, and I get my expected binding:
>>> 
>>> rank 0 @ compute-1-2.local  1, 17,
>>> rank 1 @ compute-1-2.local  2, 18,
>>> rank 2 @ compute-1-2.local  3, 19,
>>> rank 3 @ compute-1-2.local  4, 20,
>>> rank 4 @ compute-1-4.local  1, 17,
>>> rank 5 @ compute-1-4.local  15, 31,
>>> rank 6 @ compute-1-8.local  0, 16,
>>> rank 7 @ compute-1-8.local  5, 21,
>>> rank 8 @ compute-1-8.local  9, 25,
>>> rank 9 @ compute-1-8.local  13, 29,
>>> rank 10 @ compute-1-8.local  14, 30,
>>> rank 11 @ compute-1-13.local  3, 19,
>>> rank 12 @ compute-1-13.local  4, 20,
>>> rank 13 @ compute-1-13.local  5, 21,
>>> rank 14 @ compute-1-13.local  6, 22,
>>> rank 15 @ compute-1-13.local  7, 23,
>>> rank 16

Re: [OMPI users] worse latency in 1.8 c.f. 1.6

2015-10-05 Thread Dave Love
Mike Dubman  writes:

> what is your command line and setup? (ofed version, distro)
>
> This is what was just measured w/ fdr on haswell with v1.8.8 and mxm and UD
>
> + mpirun -np 2 -bind-to core -display-map -mca rmaps_base_mapping_policy
> dist:span -x MXM_RDMA_PORTS=mlx5_3:1 -mca rmaps_dist_device mlx5_3:1  -x
> MXM_TLS=self,shm,ud osu_latency

Revisiting this, I'm confused, because rmaps_dist_device isn't in my
build and I don't know what it is.  (I tried the binary hpcx stuff, but
it failed to run -- I've forgotten how -- and the build instructions for
ompi under it correspond to what I've used anyway.)  The obvious
difference between the above and what I have is mlx5 v. mlx4; is that
likely to account for it?



Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-10-05 Thread Dave Love
Mike Dubman  writes:

> right, it is not attribute of mxm, but general effect.

Thanks.  That's the sort of thing we can investigate, but then the
messages from MXM are very misleading.

> and you are right again - performance engineering will always be needed for
> best performance in some cases.
>
> OMPI, mxm trying to address out of the box performance for any workload,
> but OS tuning, hw tuning, OMPI or mxm tuning may be needed as well. (there
> is a reason that any MPI have hundreds of knobs)

Sure, but I don't expect to see things like significant increases in p2p
latency from something meant to improve p2p, and I've no obvious way of
debugging with the proprietary library, especially without knowing what
the knobs do.

Are any other users prepared to share experience with MXM on similar
systems?



[OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread marcin.krotkiewski

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose 
of cpu binding?



Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. 
This is useful for hybrid jobs, where each MPI process spawns some 
internal worker threads (e.g., OpenMP). The intention is that there are 
2 MPI procs started, each of them 'bound' to 4 cores. SLURM will also 
set an environment variable


SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that 
launches the MPI processes to figure out the cpuset. In case of OpenMPI 
+ mpirun I think something should happen in 
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually 
parsed. Unfortunately, it is never really used...


As a result, cpuset of all tasks started on a given compute node 
includes all CPU cores of all MPI tasks on that node, just as provided 
by SLURM (in the above example - 8). In general, there is no simple way 
for the user code in the MPI procs to 'split' the cores between 
themselves. I imagine the original intention to support this in OpenMPI 
was something like


mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the 
allocated cores between the mpi tasks. Is this right? If so, it seems 
that at this point this is not implemented. Is there plans to do this? 
If no, does anyone know another way to achieve that?


Thanks a lot!

Marcin





Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread Ralph Castain
You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier when we see 
“cpus_per_task”, then we probably should do so as there isn’t any reason to 
make you set it twice (well, other than trying to track which envar slurm is 
using now).


> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski 
>  wrote:
> 
> Yet another question about cpu binding under SLURM environment..
> 
> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of 
> cpu binding?
> 
> 
> Full version: When you allocate a job like, e.g., this
> 
> salloc --ntasks=2 --cpus-per-task=4
> 
> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This is 
> useful for hybrid jobs, where each MPI process spawns some internal worker 
> threads (e.g., OpenMP). The intention is that there are 2 MPI procs started, 
> each of them 'bound' to 4 cores. SLURM will also set an environment variable
> 
> SLURM_CPUS_PER_TASK=4
> 
> which should (probably?) be taken into account by the method that launches 
> the MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I 
> think something should happen in orte/mca/ras/slurm/ras_slurm_module.c, where 
> the variable _is_ actually parsed. Unfortunately, it is never really used...
> 
> As a result, cpuset of all tasks started on a given compute node includes all 
> CPU cores of all MPI tasks on that node, just as provided by SLURM (in the 
> above example - 8). In general, there is no simple way for the user code in 
> the MPI procs to 'split' the cores between themselves. I imagine the original 
> intention to support this in OpenMPI was something like
> 
> mpirun --bind-to subtask_cpuset
> 
> with an artificial bind target that would cause OpenMPI to divide the 
> allocated cores between the mpi tasks. Is this right? If so, it seems that at 
> this point this is not implemented. Is there plans to do this? If no, does 
> anyone know another way to achieve that?
> 
> Thanks a lot!
> 
> Marcin
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27803.php



Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread marcin.krotkiewski

Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an 
error:


$ mpirun --map-by core:pe=4 ./affinity
--
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier when we see 
“cpus_per_task”, then we probably should do so as there isn’t any reason to 
make you set it twice (well, other than trying to track which envar slurm is 
using now).



On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski  
wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of cpu 
binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This is 
useful for hybrid jobs, where each MPI process spawns some internal worker 
threads (e.g., OpenMP). The intention is that there are 2 MPI procs started, 
each of them 'bound' to 4 cores. SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that launches the 
MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I think 
something should happen in orte/mca/ras/slurm/ras_slurm_module.c, where the 
variable _is_ actually parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node includes all 
CPU cores of all MPI tasks on that node, just as provided by SLURM (in the 
above example - 8). In general, there is no simple way for the user code in the 
MPI procs to 'split' the cores between themselves. I imagine the original 
intention to support this in OpenMPI was something like

mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the allocated 
cores between the mpi tasks. Is this right? If so, it seems that at this point 
this is not implemented. Is there plans to do this? If no, does anyone know 
another way to achieve that?

Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27804.php




Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread Ralph Castain
Hmmm…okay, try -map-by socket:pe=4

We’ll still hit the asymmetric topology issue, but otherwise this should work


> On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski  
> wrote:
> 
> Ralph,
> 
> Thank you for a fast response! Sounds very good, unfortunately I get an error:
> 
> $ mpirun --map-by core:pe=4 ./affinity
> --
> A request for multiple cpus-per-proc was given, but a directive
> was also give to map to an object level that cannot support that
> directive.
> 
> Please specify a mapping level that has more than one cpu, or
> else let us define a default mapping that will allow multiple
> cpus-per-proc.
> --
> 
> I have allocated my slurm job as
> 
> salloc --ntasks=2 --cpus-per-task=4
> 
> I have checked in 1.10.0 and 1.10.1rc1.
> 
> 
> 
> 
> On 10/05/2015 09:58 PM, Ralph Castain wrote:
>> You would presently do:
>> 
>> mpirun —map-by core:pe=4
>> 
>> to get what you are seeking. If we don’t already set that qualifier when we 
>> see “cpus_per_task”, then we probably should do so as there isn’t any reason 
>> to make you set it twice (well, other than trying to track which envar slurm 
>> is using now).
>> 
>> 
>>> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski 
>>>  wrote:
>>> 
>>> Yet another question about cpu binding under SLURM environment..
>>> 
>>> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of 
>>> cpu binding?
>>> 
>>> 
>>> Full version: When you allocate a job like, e.g., this
>>> 
>>> salloc --ntasks=2 --cpus-per-task=4
>>> 
>>> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This 
>>> is useful for hybrid jobs, where each MPI process spawns some internal 
>>> worker threads (e.g., OpenMP). The intention is that there are 2 MPI procs 
>>> started, each of them 'bound' to 4 cores. SLURM will also set an 
>>> environment variable
>>> 
>>> SLURM_CPUS_PER_TASK=4
>>> 
>>> which should (probably?) be taken into account by the method that launches 
>>> the MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I 
>>> think something should happen in orte/mca/ras/slurm/ras_slurm_module.c, 
>>> where the variable _is_ actually parsed. Unfortunately, it is never really 
>>> used...
>>> 
>>> As a result, cpuset of all tasks started on a given compute node includes 
>>> all CPU cores of all MPI tasks on that node, just as provided by SLURM (in 
>>> the above example - 8). In general, there is no simple way for the user 
>>> code in the MPI procs to 'split' the cores between themselves. I imagine 
>>> the original intention to support this in OpenMPI was something like
>>> 
>>> mpirun --bind-to subtask_cpuset
>>> 
>>> with an artificial bind target that would cause OpenMPI to divide the 
>>> allocated cores between the mpi tasks. Is this right? If so, it seems that 
>>> at this point this is not implemented. Is there plans to do this? If no, 
>>> does anyone know another way to achieve that?
>>> 
>>> Thanks a lot!
>>> 
>>> Marcin
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27803.php
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27804.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27805.php



Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Jeff Squyres (jsquyres)
I filed an issue to track this problem here:

https://github.com/open-mpi/ompi/issues/978


> On Oct 5, 2015, at 1:01 PM, Ralph Castain  wrote:
> 
> Thanks Marcin. I think we have three things we need to address:
> 
> 1. the warning needs to be emitted regardless of whether or not 
> —report-bindings was given. Not sure how that warning got “covered” by the 
> option, but it is clearly a bug
> 
> 2. improve the warning to include binding info - relatively easy to do
> 
> 3. fix the mapping/binding under asymmetric topologies. Given further info 
> and consideration, I’m increasingly pushed towards the “fallback to the 
> map-by core default” solution. It provides a predictable and consistent 
> pattern. The other solution is technically viable, but leads to an 
> unpredictable “opportunistic” result that might cause odd application 
> behavior. If the user specifies a mapping option and we can’t do it because 
> of asymmetry, then error out.
> 
> HTH
> Ralph
> 
> 
>> On Oct 5, 2015, at 9:36 AM, marcin.krotkiewski 
>>  wrote:
>> 
>> Hi, Gilles
>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
>>> could you please send the full details (script, allocation and output)
>>> in your slurm script, you can do
>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
>>> Cpus_allowed_list /proc/self/status
>>> before invoking mpirun
>>> 
>> It was an interactive job allocated with
>> 
>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
>> 
>> The slurm environment is the following
>> 
>> SLURM_JOBID=12714491
>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>> SLURM_JOB_ID=12714491
>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_JOB_NUM_NODES=7
>> SLURM_JOB_PARTITION=normal
>> SLURM_MEM_PER_CPU=2048
>> SLURM_NNODES=7
>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_NODE_ALIASES='(null)'
>> SLURM_NPROCS=32
>> SLURM_NTASKS=32
>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>> SLURM_SUBMIT_HOST=login-0-1.local
>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>> 
>> The output of the command you asked for is
>> 
>> 0: c1-2.local  Cpus_allowed_list:1-4,17-20
>> 1: c1-4.local  Cpus_allowed_list:1,15,17,31
>> 2: c1-8.local  Cpus_allowed_list:0,5,9,13-14,16,21,25,29-30
>> 3: c1-13.local  Cpus_allowed_list:   3-7,19-23
>> 4: c1-16.local  Cpus_allowed_list:   12-15,28-31
>> 5: c1-23.local  Cpus_allowed_list:   2-4,8,13-15,18-20,24,29-31
>> 6: c1-26.local  Cpus_allowed_list:   1,6,11,13,15,17,22,27,29,31
>> 
>> Running with command
>> 
>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
>> --report-bindings --map-by socket -np 32 ./affinity
>> 
>> I have attached two output files: one for the original 1.10.1rc1, one for 
>> the patched version.
>> 
>> When I said 'failed in one case' I was not precise. I got an error on node 
>> c1-8, which was the first one to have different number of MPI processes on 
>> the two sockets. It would also fail on some later nodes, just that because 
>> of the error we never got there.
>> 
>> Let me know if you need more.
>> 
>> Marcin
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
 Hi, all,
 
 I played a bit more and it seems that the problem results from
 
 trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
 
 called in rmaps_base_binding.c / bind_downwards being wrong. I do not know 
 the reason, but I think I know when the problem happens (at least on 
 1.10.1rc1). It seems that by default openmpi maps by socket. The error 
 happens when for a given compute node there is a different number of cores 
 used on each socket. Consider previously studied case (the debug outputs I 
 sent in last post). c1-8, which was source of error, has 5 mpi processes 
 assigned, and the cpuset is the following:
 
 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
 
 Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
 progresses correctly up to and including core 13 (see end of file 
 out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores 
 on socket 1. Error is thrown when core 14 should be bound - extra core on 
 socket 1 with no corresponding core on socket 0. At that point the 
 returned trg_obj points to the first core on the node (os_index 0, socket 
 0).
 
 I have submitted a few other jobs and I always had an error in such 
 situation. Moreover, if I now use --map-by core instead of socket, the 
 error is gone, and I get my expected binding:
 
 rank 0 @ compute-1-2.local  1, 17,
 rank 1 @ compute-1-2.local  2, 18,
 rank 2 @ compute-1-2.local  3, 19,
 rank 3 @ compute-1-2.local  4, 20,
 rank 4 @ compute-1-4.local  1, 17,
 rank 5 @ compute-1-4.local  15, 31,
 rank 6 @ compute-1-8.local  0, 16,
 rank 7 @ compute-1-8.local  5, 21,
 rank 8 @ compute-1-8.local  9, 

Re: [OMPI users] [Open MPI Announce] Open MPI v1.10.1rc1 release

2015-10-05 Thread Jeff Squyres (jsquyres)
On Oct 3, 2015, at 9:14 AM, Dimitar Pashov  wrote:
> 
> Hi, I have a pet bug causing silent data corruption here:
>https://github.com/open-mpi/ompi/issues/965 
> which seems to have a fix committed some time later. I've tested v1.10.1rc1 
> now and it still has the issue. I hope the fix makes it in the release.

Marked as a blocker for the 1.10.1 release; thanks for the heads-up.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread tmishima
Hi Ralph, it's been a long time.

The option "map-by core" does not work when pe=N > 1 is specified.
So, you should use "map-by slot:pe=N" as far as I remember.

Regards,
Tetsuya Mishima

2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
tasks using SLURM」で書きました
> Hmmm…okay, try -map-by socket:pe=4
>
> We’ll still hit the asymmetric topology issue, but otherwise this should
work
>
>
> > On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski
 wrote:
> >
> > Ralph,
> >
> > Thank you for a fast response! Sounds very good, unfortunately I get an
error:
> >
> > $ mpirun --map-by core:pe=4 ./affinity
> >
--
> > A request for multiple cpus-per-proc was given, but a directive
> > was also give to map to an object level that cannot support that
> > directive.
> >
> > Please specify a mapping level that has more than one cpu, or
> > else let us define a default mapping that will allow multiple
> > cpus-per-proc.
> >
--
> >
> > I have allocated my slurm job as
> >
> > salloc --ntasks=2 --cpus-per-task=4
> >
> > I have checked in 1.10.0 and 1.10.1rc1.
> >
> >
> >
> >
> > On 10/05/2015 09:58 PM, Ralph Castain wrote:
> >> You would presently do:
> >>
> >> mpirun —map-by core:pe=4
> >>
> >> to get what you are seeking. If we don’t already set that qualifier
when we see “cpus_per_task”, then we probably should do so as there isn’t
any reason to make you set it twice (well, other than
> trying to track which envar slurm is using now).
> >>
> >>
> >>> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski
 wrote:
> >>>
> >>> Yet another question about cpu binding under SLURM environment..
> >>>
> >>> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the
purpose of cpu binding?
> >>>
> >>>
> >>> Full version: When you allocate a job like, e.g., this
> >>>
> >>> salloc --ntasks=2 --cpus-per-task=4
> >>>
> >>> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.
This is useful for hybrid jobs, where each MPI process spawns some internal
worker threads (e.g., OpenMP). The intention is
> that there are 2 MPI procs started, each of them 'bound' to 4 cores.
SLURM will also set an environment variable
> >>>
> >>> SLURM_CPUS_PER_TASK=4
> >>>
> >>> which should (probably?) be taken into account by the method that
launches the MPI processes to figure out the cpuset. In case of OpenMPI +
mpirun I think something should happen in
> orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually
parsed. Unfortunately, it is never really used...
> >>>
> >>> As a result, cpuset of all tasks started on a given compute node
includes all CPU cores of all MPI tasks on that node, just as provided by
SLURM (in the above example - 8). In general, there is
> no simple way for the user code in the MPI procs to 'split' the cores
between themselves. I imagine the original intention to support this in
OpenMPI was something like
> >>>
> >>> mpirun --bind-to subtask_cpuset
> >>>
> >>> with an artificial bind target that would cause OpenMPI to divide the
allocated cores between the mpi tasks. Is this right? If so, it seems that
at this point this is not implemented. Is there
> plans to do this? If no, does anyone know another way to achieve that?
> >>>
> >>> Thanks a lot!
> >>>
> >>> Marcin
> >>>
> >>>
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27803.php
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27804.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27805.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
this post: http://www.open-mpi.org/community/lists/users/2015/10/27806.php

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread Ralph Castain
Ah, yes - thanks! It’s been so long since I played with that option I honestly 
forgot :-)

Hope you are doing well !
Ralph

> On Oct 5, 2015, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote:
> 
> Hi Ralph, it's been a long time.
> 
> The option "map-by core" does not work when pe=N > 1 is specified.
> So, you should use "map-by slot:pe=N" as far as I remember.
> 
> Regards,
> Tetsuya Mishima
> 
> 2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
> tasks using SLURM」で書きました
>> Hmmm…okay, try -map-by socket:pe=4
>> 
>> We’ll still hit the asymmetric topology issue, but otherwise this should
> work
>> 
>> 
>>> On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski
>  wrote:
>>> 
>>> Ralph,
>>> 
>>> Thank you for a fast response! Sounds very good, unfortunately I get an
> error:
>>> 
>>> $ mpirun --map-by core:pe=4 ./affinity
>>> 
> --
>>> A request for multiple cpus-per-proc was given, but a directive
>>> was also give to map to an object level that cannot support that
>>> directive.
>>> 
>>> Please specify a mapping level that has more than one cpu, or
>>> else let us define a default mapping that will allow multiple
>>> cpus-per-proc.
>>> 
> --
>>> 
>>> I have allocated my slurm job as
>>> 
>>> salloc --ntasks=2 --cpus-per-task=4
>>> 
>>> I have checked in 1.10.0 and 1.10.1rc1.
>>> 
>>> 
>>> 
>>> 
>>> On 10/05/2015 09:58 PM, Ralph Castain wrote:
 You would presently do:
 
 mpirun —map-by core:pe=4
 
 to get what you are seeking. If we don’t already set that qualifier
> when we see “cpus_per_task”, then we probably should do so as there isn’t
> any reason to make you set it twice (well, other than
>> trying to track which envar slurm is using now).
 
 
> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski
>  wrote:
> 
> Yet another question about cpu binding under SLURM environment..
> 
> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the
> purpose of cpu binding?
> 
> 
> Full version: When you allocate a job like, e.g., this
> 
> salloc --ntasks=2 --cpus-per-task=4
> 
> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.
> This is useful for hybrid jobs, where each MPI process spawns some internal
> worker threads (e.g., OpenMP). The intention is
>> that there are 2 MPI procs started, each of them 'bound' to 4 cores.
> SLURM will also set an environment variable
> 
> SLURM_CPUS_PER_TASK=4
> 
> which should (probably?) be taken into account by the method that
> launches the MPI processes to figure out the cpuset. In case of OpenMPI +
> mpirun I think something should happen in
>> orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually
> parsed. Unfortunately, it is never really used...
> 
> As a result, cpuset of all tasks started on a given compute node
> includes all CPU cores of all MPI tasks on that node, just as provided by
> SLURM (in the above example - 8). In general, there is
>> no simple way for the user code in the MPI procs to 'split' the cores
> between themselves. I imagine the original intention to support this in
> OpenMPI was something like
> 
> mpirun --bind-to subtask_cpuset
> 
> with an artificial bind target that would cause OpenMPI to divide the
> allocated cores between the mpi tasks. Is this right? If so, it seems that
> at this point this is not implemented. Is there
>> plans to do this? If no, does anyone know another way to achieve that?
> 
> Thanks a lot!
> 
> Marcin
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27803.php
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27804.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27805.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
> this post: http://www.open-mpi.org/community/lists/users/2015/10/27806.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27809.php



Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread tmishima
I'm doing quite well, thank you. I'm involved in a big project and so very
busy now.

But I still try to keep watching these mailing lists.

Regards,
Tetsuya Mishima

2015/10/06 8:17:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
tasks using SLURM」で書きました
> Ah, yes - thanks! It’s been so long since I played with that option I
honestly forgot :-)
>
> Hope you are doing well !
> Ralph
>
> > On Oct 5, 2015, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote:
> >
> > Hi Ralph, it's been a long time.
> >
> > The option "map-by core" does not work when pe=N > 1 is specified.
> > So, you should use "map-by slot:pe=N" as far as I remember.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > 2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI
+OpenMP
> > tasks using SLURM」で書きました
> >> Hmmm…okay, try -map-by socket:pe=4
> >>
> >> We’ll still hit the asymmetric topology issue, but otherwise this
should
> > work
> >>
> >>
> >>> On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski
> >  wrote:
> >>>
> >>> Ralph,
> >>>
> >>> Thank you for a fast response! Sounds very good, unfortunately I get
an
> > error:
> >>>
> >>> $ mpirun --map-by core:pe=4 ./affinity
> >>>
> >
--
> >>> A request for multiple cpus-per-proc was given, but a directive
> >>> was also give to map to an object level that cannot support that
> >>> directive.
> >>>
> >>> Please specify a mapping level that has more than one cpu, or
> >>> else let us define a default mapping that will allow multiple
> >>> cpus-per-proc.
> >>>
> >
--
> >>>
> >>> I have allocated my slurm job as
> >>>
> >>> salloc --ntasks=2 --cpus-per-task=4
> >>>
> >>> I have checked in 1.10.0 and 1.10.1rc1.
> >>>
> >>>
> >>>
> >>>
> >>> On 10/05/2015 09:58 PM, Ralph Castain wrote:
>  You would presently do:
> 
>  mpirun —map-by core:pe=4
> 
>  to get what you are seeking. If we don’t already set that qualifier
> > when we see “cpus_per_task”, then we probably should do so as there
isn’t
> > any reason to make you set it twice (well, other than
> >> trying to track which envar slurm is using now).
> 
> 
> > On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski
> >  wrote:
> >
> > Yet another question about cpu binding under SLURM environment..
> >
> > Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the
> > purpose of cpu binding?
> >
> >
> > Full version: When you allocate a job like, e.g., this
> >
> > salloc --ntasks=2 --cpus-per-task=4
> >
> > SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI
tasks.
> > This is useful for hybrid jobs, where each MPI process spawns some
internal
> > worker threads (e.g., OpenMP). The intention is
> >> that there are 2 MPI procs started, each of them 'bound' to 4 cores.
> > SLURM will also set an environment variable
> >
> > SLURM_CPUS_PER_TASK=4
> >
> > which should (probably?) be taken into account by the method that
> > launches the MPI processes to figure out the cpuset. In case of OpenMPI
+
> > mpirun I think something should happen in
> >> orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_
actually
> > parsed. Unfortunately, it is never really used...
> >
> > As a result, cpuset of all tasks started on a given compute node
> > includes all CPU cores of all MPI tasks on that node, just as provided
by
> > SLURM (in the above example - 8). In general, there is
> >> no simple way for the user code in the MPI procs to 'split' the cores
> > between themselves. I imagine the original intention to support this in
> > OpenMPI was something like
> >
> > mpirun --bind-to subtask_cpuset
> >
> > with an artificial bind target that would cause OpenMPI to divide
the
> > allocated cores between the mpi tasks. Is this right? If so, it seems
that
> > at this point this is not implemented. Is there
> >> plans to do this? If no, does anyone know another way to achieve that?
> >
> > Thanks a lot!
> >
> > Marcin
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/10/27803.php
>  ___
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/10/27804.php
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/10/27805.php
> >>
> >> _