[OMPI users] open shmem optimization

2014-08-29 Thread Timur Ismagilov

Hello!
What param can i tune to increase perfomance(scalability) for my app (all to 
all pattern with message size = constant/nnodes)?
I can read  this faq  for mpi, but is it correct for shmem?
I have 2 programm doing the same thing(with same input) each node send 
messages(message size = constant/nnodes) to random set of nodes (but the same 
set in prg1 and prg2):
*  with mpi_isend, mpi_irecv and mpi_waitall
*  with shmem_put and shmem_barrier_all on 1 2 4 8 16 32 nodes thay have same 
perfomance(scalabilyty)
on 64 128 256 nodes shmem programm stop scaling but over 512 nodes shmem 
programm gets much better perfomance than mpi
           1prg           2prg
           perf unit     perf unit      
1         30              30
2         50              53
4         75              85
8         110            130
16       180            200
32       310            350
64       500            400 (strange)
128     830            400 (strange)
256     1350           600 (strange)
512     1770           2350 (wow!)

In scalabel shmem(ompi 1.6.5?) I get the same scalability in this programms.


Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Reuti
Hi,

Am 28.08.2014 um 20:50 schrieb McGrattan, Kevin B. Dr.:

> My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
> node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 
> jobs. Each job requires 16 MPI processes.  For each job, I want to use two 
> cores on each node, mapping by socket. If I use these options:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 
> 
>  
> The reported bindings are:
>  
> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [burn001:09186] MCW rank 1 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> [burn004:07113] MCW rank 6 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> and so on…
>  
> These bindings appear to be OK, but when I do a “top –H” on each node, I see 
> that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
> that I am only using 1/6 or my resources. I want to use 100%. So I try this:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
> 
>  
> Now it appears that I am getting 100% usage of all cores on all nodes. The 
> bindings are:
>  
> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
> 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
> [burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 
> 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
> 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
> [burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
> 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
> 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
> [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
> 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
> and so on…
>  
> The problem now is that some of my jobs are hanging. They all start running 
> fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
> to hanging. I suspect that an MPI message is passed and not received. The 
> number of jobs that hang and the time when they hang varies from test to 
> test. We have run these cases successfully on our old cluster dozens of times 
> – they are part of our benchmark suite.
>  
> When I run these jobs using a map by core strategy (that is, the MPI 
> processes are just mapped by core, and each job only uses 16 cores on two 
> nodes), I do not see as much hanging. It still occurs, but less often. This 
> leads me to suspect that there is something about the increased network 
> traffic due to the map-by-socket approach that is the cause of the problem. 
> But I do not know what to do about it. I think that the map-by-socket 
> approach is the right one, but I do not know if I have my OpenMPI options 
> just right.
>  
> Can you tell me what OpenMPI options to use, and can you tell me how I might 
> debug the hanging issue.

BTW: In modern systems the NIC(s) can be connected directly to one CPU, while 
the other CPU first has to send the data to the other CPU to get to the NIC 
(besides that the integrated NICs may be connect to the chipset).

Did anyone ever made some benchmarks whether there is a difference in which CPU 
was used in the system, i.e. the one to which the network adapter is connected 
or the other CPU - or even to the chipset one?

-- Reuti


> Kevin McGrattan
> National Institute of Standards and Technology
> 100 Bureau Drive, Mail Stop 8664
> Gaithersburg, Maryland 20899
>  
> 301 975 2712
>  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25181.php



Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph,

For 1.8.2rc4 I get:

(1003) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
--leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
[borg01x154:10990] [[47143,0],5] orted: up and running - waiting for
commands!
Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
[borg01x144:08250] [[47143,0],2] orted: up and running - waiting for
commands!
[borg01x143:23473] [[47143,0],1] orted: up and running - waiting for
commands!
Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
[borg01x153:10902] [[47143,0],4] orted: up and running - waiting for
commands!
[borg01x145:12320] [[47143,0],3] orted: up and running - waiting for
commands!
[borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],0]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],3]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],1]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],5]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],4]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],6]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],7]
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 8
  MPIR_proctable:
(i, host, exe, pid) = (0, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
(i, host, exe, pid) = (1, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
(i, host, exe, pid) = (2, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
(i, host, exe, pid) = (3, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
(i, host, exe, pid) = (4, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
(i, host, exe, pid) = (5, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
(i, host, exe, pid) = (6, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
(i, host, exe, pid) = (7, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
Process2 of8 is on borg01x142
Process5 of8 is on borg01x142
Process4 of8 is on borg01x142
Process1 of8 is on borg01x142
Process0 of8 is on borg01x142
Process3 of8 is on borg01x142
Process6 of8 is on borg01x142
Process7 of8 is on borg01x142
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
[[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
[[4

Re: [OMPI users] open shmem optimization

2014-08-29 Thread Shamis, Pavel
Hi Timur,

I don't think this is apples-to-apples comparison.

In OpenSHMEM world "MPI_waitall" would be mapped to shmem_quiet().  
Even with this mapping,  shmem_quiet() has a *stronger* completion semantics if 
you compare it to MPI_waitall.
Quiet guarantees that the data was delivered to a remote memory, while 
MPI_waitall does not provide such guarantee for isend operations.

shmem_barrier_all is a collective operation with embedded shmem_quiet therefore 
it will not scale the same as MPI_waitall.

For more details please see:

Please see 
http://bongo.cs.uh.edu/site/sites/default/site_files/openshmem-specification-1.1.pdf
section 8.7.3

I hope it helps.

Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Aug 29, 2014, at 5:26 AM, Timur Ismagilov  wrote:

> Hello!
> 
> What param can i tune to increase perfomance(scalability) for my app (all to 
> all pattern with message size = constant/nnodes)?
> I can read this faq for mpi, but is it correct for shmem?
> 
> I have 2 programm doing the same thing(with same input) each node send 
> messages(message size = constant/nnodes) to random set of nodes (but the same 
> set in prg1 and prg2):
> 
>   • with mpi_isend, mpi_irecv and mpi_waitall
>   • with shmem_put and shmem_barrier_all
> on 1 2 4 8 16 32 nodes thay have same perfomance(scalabilyty)
> on 64 128 256 nodes shmem programm stop scaling but over 512 nodes shmem 
> programm gets much better perfomance than mpi
>1prg   2prg
>perf unit perf unit  
> 1 30  30
> 2 50  53
> 4 75  85
> 8 110130
> 16   180200
> 32   310350
> 64   500400 (strange)
> 128 830400 (strange)
> 256 1350   600 (strange)
> 512 1770   2350 (wow!)
> 
> In scalabel shmem(ompi 1.6.5?) I get the same scalability in this programms.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25185.php



[OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Maxime Boissonneault

Hi,
I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command 
just to exclude any code error. Here is the error I get  (I run with set 
-x to get the exact command that are run).


++ mpiexec -npersocket 1 ls -la
--
The requested stdin target is out of range for this job - it points
to a process rank that is greater than the number of processes in the
job.

Specified target: 0
Number of procs: 0

This could be caused by specifying a negative number for the stdin
target, or by mistyping the desired rank. Remember that MPI ranks begin
with 0, not 1.

Please correct the cmd line and try again.

How can I debug that ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Maxime Boissonneault

It looks like
-npersocket 1

cannot be used alone. If I do
mpiexec -npernode 2 -npersocket 1 ls -la

then I get no error message.

Is this expected behavior ?

Maxime


Le 2014-08-29 11:53, Maxime Boissonneault a écrit :

Hi,
I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command 
just to exclude any code error. Here is the error I get (I run with 
set -x to get the exact command that are run).


++ mpiexec -npersocket 1 ls -la
-- 


The requested stdin target is out of range for this job - it points
to a process rank that is greater than the number of processes in the
job.

Specified target: 0
Number of procs: 0

This could be caused by specifying a negative number for the stdin
target, or by mistyping the desired rank. Remember that MPI ranks begin
with 0, not 1.

Please correct the cmd line and try again.

How can I debug that ?

Thanks,




--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Ralph Castain
No, it isn't - but we aren't really maintaining the 1.6 series any more. You 
might try updating to 1.6.5 and see if it remains there

On Aug 29, 2014, at 9:12 AM, Maxime Boissonneault 
 wrote:

> It looks like
> -npersocket 1
> 
> cannot be used alone. If I do
> mpiexec -npernode 2 -npersocket 1 ls -la
> 
> then I get no error message.
> 
> Is this expected behavior ?
> 
> Maxime
> 
> 
> Le 2014-08-29 11:53, Maxime Boissonneault a écrit :
>> Hi,
>> I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command just 
>> to exclude any code error. Here is the error I get (I run with set -x to get 
>> the exact command that are run).
>> 
>> ++ mpiexec -npersocket 1 ls -la
>> -- 
>> The requested stdin target is out of range for this job - it points
>> to a process rank that is greater than the number of processes in the
>> job.
>> 
>> Specified target: 0
>> Number of procs: 0
>> 
>> This could be caused by specifying a negative number for the stdin
>> target, or by mistyping the desired rank. Remember that MPI ranks begin
>> with 0, not 1.
>> 
>> Please correct the cmd line and try again.
>> 
>> How can I debug that ?
>> 
>> Thanks,
>> 
> 
> 
> -- 
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25190.php



Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Maxime Boissonneault

It is still there in 1.6.5 (we also have it).

I am just wondering if there is something wrong in our installation that 
makes MPI unabled to detect that there are two sockets per node if we do 
not include a npernode directive.


Maxime

Le 2014-08-29 12:31, Ralph Castain a écrit :
No, it isn't - but we aren't really maintaining the 1.6 series any 
more. You might try updating to 1.6.5 and see if it remains there


On Aug 29, 2014, at 9:12 AM, Maxime Boissonneault 
> wrote:



It looks like
-npersocket 1

cannot be used alone. If I do
mpiexec -npernode 2 -npersocket 1 ls -la

then I get no error message.

Is this expected behavior ?

Maxime


Le 2014-08-29 11:53, Maxime Boissonneault a écrit :

Hi,
I am having a weird error with OpenMPI 1.6.3. I run a non-MPI 
command just to exclude any code error. Here is the error I get (I 
run with set -x to get the exact command that are run).


++ mpiexec -npersocket 1 ls -la
--
The requested stdin target is out of range for this job - it points
to a process rank that is greater than the number of processes in the
job.

Specified target: 0
Number of procs: 0

This could be caused by specifying a negative number for the stdin
target, or by mistyping the desired rank. Remember that MPI ranks begin
with 0, not 1.

Please correct the cmd line and try again.

How can I debug that ?

Thanks,




--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/08/25190.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25191.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Okay, something quite weird is happening here. I can't replicate using the 
1.8.2 release tarball on a slurm machine, so my guess is that something else is 
going on here.

Could you please rebuild the 1.8.2 code with --enable-debug on the configure 
line (assuming you haven't already done so), and then rerun that version as 
before but adding "--mca oob_base_verbose 10" to the cmd line?


On Aug 29, 2014, at 4:22 AM, Matt Thompson  wrote:

> Ralph,
> 
> For 1.8.2rc4 I get:
> 
> (1003) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun 
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for commands!
> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for commands!
> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for commands!
> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for commands!
> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for commands!
> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],0]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],2]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],3]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],1]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],5]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],4]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],6]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],7]
>   MPIR_being_debugged = 0
>   MPIR_debug_state = 1
>   MPIR_partial_attach_ok = 1
>   MPIR_i_am_starter = 0
>   MPIR_forward_output = 0
>   MPIR_proctable_size = 8
>   MPIR_proctable:
> (i, host, exe, pid) = (0, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
> (i, host, exe, pid) = (1, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
> (i, host, exe, pid) = (2, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
> (i, host, exe, pid) = (3, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
> (i, host, exe, pid) = (4, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
> (i, host, exe, pid) = (5, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
> (i, host, exe, pid) = (6, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
> (i, host, exe, pid) = (7, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> Process2 of8 is on borg01x142
> Process5 of8 is on borg01x142
> Process4 of8 is on borg01x142
> Process1 of8 is on borg01x142
> Process0 of8 is on borg01x142
> Process3 of8 is on borg01x142
> Process6 of8 is on borg01x142
> Process7 of8 is on borg01x142
> [borg01x154:10990] [[47143,0],5] orted_cmd:

Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Ralph Castain
Yeah, the old 1.6 series didn't do a very good job of auto-detection of 
#sockets. I believe there is an mca param for telling it how many are there, 
which is probably what you'd need to use.

On Aug 29, 2014, at 9:40 AM, Maxime Boissonneault 
 wrote:

> It is still there in 1.6.5 (we also have it).
> 
> I am just wondering if there is something wrong in our installation that 
> makes MPI unabled to detect that there are two sockets per node if we do not 
> include a npernode directive. 
> 
> Maxime
> 
> Le 2014-08-29 12:31, Ralph Castain a écrit :
>> No, it isn't - but we aren't really maintaining the 1.6 series any more. You 
>> might try updating to 1.6.5 and see if it remains there
>> 
>> On Aug 29, 2014, at 9:12 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> It looks like
>>> -npersocket 1
>>> 
>>> cannot be used alone. If I do
>>> mpiexec -npernode 2 -npersocket 1 ls -la
>>> 
>>> then I get no error message.
>>> 
>>> Is this expected behavior ?
>>> 
>>> Maxime
>>> 
>>> 
>>> Le 2014-08-29 11:53, Maxime Boissonneault a écrit :
 Hi,
 I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command just 
 to exclude any code error. Here is the error I get (I run with set -x to 
 get the exact command that are run).
 
 ++ mpiexec -npersocket 1 ls -la
 -- 
 The requested stdin target is out of range for this job - it points
 to a process rank that is greater than the number of processes in the
 job.
 
 Specified target: 0
 Number of procs: 0
 
 This could be caused by specifying a negative number for the stdin
 target, or by mistyping the desired rank. Remember that MPI ranks begin
 with 0, not 1.
 
 Please correct the cmd line and try again.
 
 How can I debug that ?
 
 Thanks,
 
>>> 
>>> 
>>> -- 
>>> -
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25190.php
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25191.php
> 
> 
> -- 
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25192.php



Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread McGrattan, Kevin B. Dr.
Thanks for the tip. I understand how using the --cpuset option would help me in 
the example I described. However, suppose I have multiple users submitting MPI 
jobs of various sizes? I wouldn't know a priori which cores were in use and 
which weren't. I always assumed that this is what these various schedulers did. 
Is there a way to map-by socket but not allow a single core to be used by more 
than one process. At first glance, I thought that --map-by socket and --bind-to 
core would do this. Would one of these "NOOVERSUBSCRIBE" options help?

Also, in my test case, I have exactly the right amount of cores (240) to run 15 
jobs using 16 MPI processes. I am shaking down a new cluster we just bought. 
This is an extreme case, but not atypical of the way we use our clusters.

--

List-Post: users@lists.open-mpi.org
Date: Thu, 28 Aug 2014 13:27:12 -0700
From: Ralph Castain 
To: Open MPI Users 
Subject: Re: [OMPI users] How does binding option affect network
traffic?
Message-ID: <637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>
Content-Type: text/plain; charset="windows-1252"


On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. 
 wrote:

> My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
> node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 
> jobs. Each job requires 16 MPI processes.  For each job, I want to use two 
> cores on each node, mapping by socket. If I use these options:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 
> 
>  
> The reported bindings are:
>  
> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket 
> 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank 
> 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] 
> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.] and so on?
>  
> These bindings appear to be OK, but when I do a ?top ?H? on each node, I see 
> that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
> that I am only using 1/6 or my resources.

That is correct. The problem is that each mpirun execution has no idea what the 
others are doing, or even that they exist. Thus, they will each independently 
bind to core zero and core 6, as you observe. You can get around this by 
submitting each with a separate --cpuset argument telling it which cpus it is 
allowed to use - something like this (note that there is no value to having 
pe=1 as that is automatically what happens with bind-to core):

mpirun --cpuset 0,6 --bind-to core  
mpirun --cpuset 1,7 --bind-to core  ...

etc. You specified only two procs/node with your PBS request, so we'll only map 
two on each node. This command line tells the first mpirun to only use cores 0 
and 6, and to bind each proc to one of those cores. The second uses only cores 
1 and 7, and thus is separated from the first command.

However, you should note that you can't run 15 jobs at the same time in the 
manner you describe without overloading some cores as you only have 12 
cores/node. This will create a poor-performance situation.


> I want to use 100%. So I try this:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
> 
>  
> Now it appears that I am getting 100% usage of all cores on all nodes. The 
> bindings are:
>  
> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], 
> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
> [B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket 
> 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
> [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket 1[core 
> 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 
> 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
> [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 2 bound to socket 0[core 
> 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 
> 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
> [B/B/B/B/B/B][./././././.] and so on?
>  
> The problem now is that some of my jobs are hanging. They all start running 
> fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
> to hanging. I suspect that an MPI message is passed and not received. The 
> number of jobs that hang and the time when they hang varies from test to 
> test. We have run these cases successfully on our old cluster dozens of times 
> ? they are part of our benchmark suite.

Did you have more cores on your old cluster? I suspect the problem here is 
resource exhaustion, especially if you are using Infiniband as you are 

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph,

Here you go:

(1080) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob
components
[borg01x142:29232] mca: base: components_register: found loaded component
tcp
[borg01x142:29232] mca: base: components_register: component tcp register
function successful
[borg01x142:29232] mca: base: components_open: opening oob components
[borg01x142:29232] mca: base: components_open: found loaded component tcp
[borg01x142:29232] mca: base: components_open: component tcp open function
successful
[borg01x142:29232] mca:oob:select: checking available component tcp
[borg01x142:29232] mca:oob:select: Querying component [tcp]
[borg01x142:29232] oob:tcp: component_available called
[borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our
list of V4 connections
[borg01x142:29232] [[52298,0],0] TCP STARTUP
[borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
[borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
[borg01x142:29232] mca:oob:select: Adding component to end
[borg01x142:29232] mca:oob:select: Found 1 active transports
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01x153:01290] mca: base: components_register: registering oob
components
[borg01x153:01290] mca: base: components_register: found loaded component
tcp
[borg01x143:13793] mca: base: components_register: registering oob
components
[borg01x143:13793] mca: base: components_register: found loaded component
tcp
[borg01x153:01290] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] mca: base: components_open: opening oob components
[borg01x153:01290] mca: base: components_open: found loaded component tcp
[borg01x153:01290] mca: base: components_open: component tcp open function
successful
[borg01x153:01290] mca:oob:select: checking available component tcp
[borg01x153:01290] mca:oob:select: Querying component [tcp]
[borg01x153:01290] oob:tcp: component_available called
[borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our
list of V4 connections
[borg01x153:01290] [[52298,0],4] TCP STARTUP
[borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
[borg01x143:13793] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
[borg01x143:13793] mca: base: components_open: opening oob components
[borg01x143:13793] mca: base: components_open: found loaded component tcp
[borg01x143:13793] mca: base: components_open: component tcp open function
successful
[borg01x143:13793] mca:oob:select: checking available component tcp
[borg01x143:13793] mca:oob:select: Querying component [tcp]
[borg01x143:13793] oob:tcp: component_available called
[borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our
list of V4 connections
[borg01x143:13793] [[52298,0],1] TCP STARTUP
[borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0
[borg01x153:01290] mca:oob:select: Adding component to end
[borg01x153:01290] mca:oob:select: Found 1 active transports
[borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719
[borg01x143:13793] mca:oob:select: Adding component to end
[borg01x143:13793] mca:oob:select: 

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line 
being executed. Can you add it?


On Aug 29, 2014, at 11:16 AM, Matt Thompson  wrote:

> Ralph,
> 
> Here you go:
> 
> (1080) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun 
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 
> ./helloWorld.182-debug.x
> [borg01x142:29232] mca: base: components_register: registering oob components
> [borg01x142:29232] mca: base: components_register: found loaded component tcp
> [borg01x142:29232] mca: base: components_register: component tcp register 
> function successful
> [borg01x142:29232] mca: base: components_open: opening oob components
> [borg01x142:29232] mca: base: components_open: found loaded component tcp
> [borg01x142:29232] mca: base: components_open: component tcp open function 
> successful
> [borg01x142:29232] mca:oob:select: checking available component tcp
> [borg01x142:29232] mca:oob:select: Querying component [tcp]
> [borg01x142:29232] oob:tcp: component_available called
> [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our list 
> of V4 connections
> [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our list 
> of V4 connections
> [borg01x142:29232] [[52298,0],0] TCP STARTUP
> [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
> [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
> [borg01x142:29232] mca:oob:select: Adding component to end
> [borg01x142:29232] mca:oob:select: Found 1 active transports
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> [borg01x153:01290] mca: base: components_register: registering oob components
> [borg01x153:01290] mca: base: components_register: found loaded component tcp
> [borg01x143:13793] mca: base: components_register: registering oob components
> [borg01x143:13793] mca: base: components_register: found loaded component tcp
> [borg01x153:01290] mca: base: components_register: component tcp register 
> function successful
> [borg01x153:01290] mca: base: components_open: opening oob components
> [borg01x153:01290] mca: base: components_open: found loaded component tcp
> [borg01x153:01290] mca: base: components_open: component tcp open function 
> successful
> [borg01x153:01290] mca:oob:select: checking available component tcp
> [borg01x153:01290] mca:oob:select: Querying component [tcp]
> [borg01x153:01290] oob:tcp: component_available called
> [borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our list 
> of V4 connections
> [borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our list 
> of V4 connections
> [borg01x153:01290] [[52298,0],4] TCP STARTUP
> [borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
> [borg01x143:13793] mca: base: components_register: component tcp register 
> function successful
> [borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
> [borg01x143:13793] mca: base: components_open: opening oob components
> [borg01x143:13793] mca: base: components_open: found loaded component tcp
> [borg01x143:13793] mca: base: components_open: component tcp open function 
> successful
> [borg01x143:13793] mca:oob:select: checking available component tcp
> [borg01x143:13793] mca:oob:select: Querying component [tcp]
> [borg01x143:13793] oob:tcp: component_available called
> [borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our list 
> of V4 connections
> [borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our list 
> of V4 connections
> [borg01x143:13793] [[52298,0

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Ralph Castain

On Aug 29, 2014, at 10:51 AM, McGrattan, Kevin B. Dr. 
 wrote:

> Thanks for the tip. I understand how using the --cpuset option would help me 
> in the example I described. However, suppose I have multiple users submitting 
> MPI jobs of various sizes? I wouldn't know a priori which cores were in use 
> and which weren't. I always assumed that this is what these various 
> schedulers did. Is there a way to map-by socket but not allow a single core 
> to be used by more than one process. At first glance, I thought that --map-by 
> socket and --bind-to core would do this. Would one of these "NOOVERSUBSCRIBE" 
> options help?

I'm afraid not - the issue here is that the mpirun's don't know about each 
other. What you'd need to do is have your scheduler assign cores for our use - 
we'll pick that up and stay inside that envelope. The exact scheduler command 
depends on the scheduler, of course, but the scheduler would then have the more 
global picture and could keep things separated.

> 
> Also, in my test case, I have exactly the right amount of cores (240) to run 
> 15 jobs using 16 MPI processes. I am shaking down a new cluster we just 
> bought. This is an extreme case, but not atypical of the way we use our 
> clusters.

Well, you do, but not exactly the way you showed you were trying to use this. 
If you try to run as you described, with 2ppn for each mpirun and 12 
cores/node, you can run a maximum of 6 mpirun's at a time across a given set of 
nodes. So you'd need to stage your allocations correctly to make it work.



> 
> --
> 
> Date: Thu, 28 Aug 2014 13:27:12 -0700
> From: Ralph Castain 
> To: Open MPI Users 
> Subject: Re: [OMPI users] How does binding option affect network
>   traffic?
> Message-ID: <637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>
> Content-Type: text/plain; charset="windows-1252"
> 
> 
> On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. 
>  wrote:
> 
>> My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
>> node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 
>> jobs. Each job requires 16 MPI processes.  For each job, I want to use two 
>> cores on each node, mapping by socket. If I use these options:
>> 
>> #PBS -l nodes=8:ppn=2
>> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 
>> 
>> 
>> The reported bindings are:
>> 
>> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
>> [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket 
>> 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank 
>> 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] 
>> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
>> [./././././.][B/././././.] and so on?
>> 
>> These bindings appear to be OK, but when I do a ?top ?H? on each node, I see 
>> that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
>> that I am only using 1/6 or my resources.
> 
> That is correct. The problem is that each mpirun execution has no idea what 
> the others are doing, or even that they exist. Thus, they will each 
> independently bind to core zero and core 6, as you observe. You can get 
> around this by submitting each with a separate --cpuset argument telling it 
> which cpus it is allowed to use - something like this (note that there is no 
> value to having pe=1 as that is automatically what happens with bind-to core):
> 
> mpirun --cpuset 0,6 --bind-to core  
> mpirun --cpuset 1,7 --bind-to core  ...
> 
> etc. You specified only two procs/node with your PBS request, so we'll only 
> map two on each node. This command line tells the first mpirun to only use 
> cores 0 and 6, and to bind each proc to one of those cores. The second uses 
> only cores 1 and 7, and thus is separated from the first command.
> 
> However, you should note that you can't run 15 jobs at the same time in the 
> manner you describe without overloading some cores as you only have 12 
> cores/node. This will create a poor-performance situation.
> 
> 
>> I want to use 100%. So I try this:
>> 
>> #PBS -l nodes=8:ppn=2
>> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
>> 
>> 
>> Now it appears that I am getting 100% usage of all cores on all nodes. The 
>> bindings are:
>> 
>> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], 
>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
>> [B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket 
>> 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 
>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
>> [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket 1[core 
>> 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 
>> 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
>> [./././././.][

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread McGrattan, Kevin B. Dr.
I am able to run all 15 of my jobs simultaneously; 16 MPI processes per job; 
mapping by socket and binding to socket. On a given socket, 6 MPI processes 
from 6 separate mpiruns share the 6 cores, or at least I assume they are 
sharing. The load for all CPUs and all processes is 100%. I understand that I 
am loading the system to its limits, but is what I am doing OK? My jobs are 
running, and the only problem seems to be that some jobs are hanging at random 
times. This is a new cluster I am shaking down, and I am guessing that the 
message passing traffic is causing packet losses. I am working with the vendor 
to sort this out, but I am curious whether or not I am using OpenMPI 
appropriately.

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 


The bindings are:

[burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.]
[burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.] and so on.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, August 29, 2014 3:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] How does binding option affect network traffic?


On Aug 29, 2014, at 10:51 AM, McGrattan, Kevin B. Dr. 
mailto:kevin.mcgrat...@nist.gov>> wrote:


Thanks for the tip. I understand how using the --cpuset option would help me in 
the example I described. However, suppose I have multiple users submitting MPI 
jobs of various sizes? I wouldn't know a priori which cores were in use and 
which weren't. I always assumed that this is what these various schedulers did. 
Is there a way to map-by socket but not allow a single core to be used by more 
than one process. At first glance, I thought that --map-by socket and --bind-to 
core would do this. Would one of these "NOOVERSUBSCRIBE" options help?

I'm afraid not - the issue here is that the mpirun's don't know about each 
other. What you'd need to do is have your scheduler assign cores for our use - 
we'll pick that up and stay inside that envelope. The exact scheduler command 
depends on the scheduler, of course, but the scheduler would then have the more 
global picture and could keep things separated.



Also, in my test case, I have exactly the right amount of cores (240) to run 15 
jobs using 16 MPI processes. I am shaking down a new cluster we just bought. 
This is an extreme case, but not atypical of the way we use our clusters.

Well, you do, but not exactly the way you showed you were trying to use this. 
If you try to run as you described, with 2ppn for each mpirun and 12 
cores/node, you can run a maximum of 6 mpirun's at a time across a given set of 
nodes. So you'd need to stage your allocations correctly to make it work.





--

List-Post: users@lists.open-mpi.org
Date: Thu, 28 Aug 2014 13:27:12 -0700
From: Ralph Castain mailto:r...@open-mpi.org>>
To: Open MPI Users mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] How does binding option affect network
traffic?
Message-ID: 
<637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>
Content-Type: text/plain; charset="windows-1252"


On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. 
mailto:kevin.mcgrat...@nist.gov>> wrote:


My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 jobs. 
Each job requires 16 MPI processes.  For each job, I want to use two cores on 
each node, mapping by socket. If I use these options:

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16


The reported bindings are:

[burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket
1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank
6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
[./././././.][B/././././.] and so on?

These bindings appear to be OK, but when I do a ?top ?H? on each node, I see 
that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
that I am only using 1/6 or my reso

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Ralph Castain
Should be okay. I suspect you are correct in that something isn't right in
the fabric.



On Fri, Aug 29, 2014 at 1:06 PM, McGrattan, Kevin B. Dr. <
kevin.mcgrat...@nist.gov> wrote:

>  I am able to run all 15 of my jobs simultaneously; 16 MPI processes per
> job; mapping by socket and binding to socket. On a given socket, 6 MPI
> processes from 6 separate mpiruns share the 6 cores, or at least I assume
> they are sharing. The load for all CPUs and all processes is 100%. I
> understand that I am loading the system to its limits, but is what I am
> doing OK? My jobs are running, and the only problem seems to be that some
> jobs are hanging at random times. This is a new cluster I am shaking down,
> and I am guessing that the message passing traffic is causing packet
> losses. I am working with the vendor to sort this out, but I am curious
> whether or not I am using OpenMPI appropriately.
>
>
>
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
>  file name>
>
> The bindings are:
>
> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket
> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.]
>
> [burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket
> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket
> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>
> [burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core
> 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core
> 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>
> [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core
> 4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.] and so on.
>
>
>
> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Ralph
> Castain
> *Sent:* Friday, August 29, 2014 3:26 PM
> *To:* Open MPI Users
>
> *Subject:* Re: [OMPI users] How does binding option affect network
> traffic?
>
>
>
>
>
> On Aug 29, 2014, at 10:51 AM, McGrattan, Kevin B. Dr. <
> kevin.mcgrat...@nist.gov> wrote:
>
>
>
>  Thanks for the tip. I understand how using the --cpuset option would
> help me in the example I described. However, suppose I have multiple users
> submitting MPI jobs of various sizes? I wouldn't know a priori which cores
> were in use and which weren't. I always assumed that this is what these
> various schedulers did. Is there a way to map-by socket but not allow a
> single core to be used by more than one process. At first glance, I thought
> that --map-by socket and --bind-to core would do this. Would one of these
> "NOOVERSUBSCRIBE" options help?
>
>
>
> I'm afraid not - the issue here is that the mpirun's don't know about each
> other. What you'd need to do is have your scheduler assign cores for our
> use - we'll pick that up and stay inside that envelope. The exact scheduler
> command depends on the scheduler, of course, but the scheduler would then
> have the more global picture and could keep things separated.
>
>
>
>
> Also, in my test case, I have exactly the right amount of cores (240) to
> run 15 jobs using 16 MPI processes. I am shaking down a new cluster we just
> bought. This is an extreme case, but not atypical of the way we use our
> clusters.
>
>
>
> Well, you do, but not exactly the way you showed you were trying to use
> this. If you try to run as you described, with 2ppn for each mpirun and 12
> cores/node, you can run a maximum of 6 mpirun's at a time across a given
> set of nodes. So you'd need to stage your allocations correctly to make it
> work.
>
>
>
>
>
>
>
>
> --
>
> Date: Thu, 28 Aug 2014 13:27:12 -0700
> From: Ralph Castain 
> To: Open MPI Users 
> Subject: Re: [OMPI users] How does binding option affect network
> traffic?
> Message-ID: <637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>
> Content-Type: text/plain; charset="windows-1252"
>
>
> On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. <
> kevin.mcgrat...@nist.gov> wrote:
>
>
>  My institute recently purchased a linux cluster with 20 nodes; 2 sockets
> per node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run
> 15 jobs. Each job requires 16 MPI processes.  For each job, I want to use
> two cores on each node, mapping by socket. If I use these options:
>
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16
> 
>
> The reported bindings are:
>
> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket
> 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank
> 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread tmishima
Hi,

Your cluster is very similar to ours where Torque and OpenMPI
is installed.

I would use this cmd line:

#PBS -l nodes=2:ppn=12
mpirun --report-bindings -np 16 

Here --map-by socket:pe=1 and -bind-to core is assumed as default setting.
Then, you can run 10 jobs independently and simultaneously beacause you
have 20 nodes totally.

While each node in your cluser has 12 cores, the nprocs per node
running on a node is 8, which means 66.666 % use, not 100%.
I think this loss can not be avoided as long as you use 16*N MPI per job.
It's a kind of mismatch with your cluster which has 12 cores per node.
If you can use 12*N MPI per job, then it's most effective.
Is there any reason why you use 16*N MPI per job?

Tetsuya