Re: [OMPI users] Openmpi 1.10.x, mpirun and Slurm 15.08 problem

2016-09-23 Thread marcin.krotkiewski

Thanks for a quick answer, Ralph!

This does not work, because em4 is only defined on the frontend node. 
Now I get errors from the computes:


[compute-1-4.local:12206] found interface lo
[compute-1-4.local:12206] found interface em1
[compute-1-4.local:12206] mca: base: components_open: component 
posix_ipv4 open function successful
[compute-1-4.local:12206] mca: base: components_open: found loaded 
component linux_ipv6
[compute-1-4.local:12206] mca: base: components_open: component 
linux_ipv6 open function successful

--
None of the TCP networks specified to be included for out-of-band 
communications

could be found:

  Value given: em4

Please revise the specification and try again.
--
--
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--

But since only the front-end node has a different network config, the 
problem only exists when I run interactive sessions using salloc. If I 
use sbatch to submit the jobs, they are executed correctly. uff.


Thanks for your help, now I can make my way through it!

Marcin



On 09/23/2016 04:45 PM, r...@open-mpi.org wrote:

This isn’t an issue with the SLURM integration - this is the problem of our OOB 
not correctly picking the right subnet for connecting back to mpirun. In this 
specific case, you probably want

-mca btl_tcp_if_include em4 -mca oob_tcp_if_include em4

since it is the em4 network that ties the compute nodes together, and the 
compute nodes to the frontend

We are working on the subnet selection logic, but the 1.10 series seems to have 
not been updated with those changes


On Sep 23, 2016, at 6:00 AM, Marcin Krotkiewski  
wrote:

Hi,

I have stumbled upon a similar issue, so I wonder those might be related. On 
one of our systems I get the following error message, both when using openmpi 
1.8.8 and 1.10.4

$ mpirun -debug-daemons --mca btl tcp,self --mca mca_base_verbose 100 --mca 
btl_base_verbose 100 ls

[...]
[compute-1-1.local:07302] mca: base: close: unloading component direct
[compute-1-1.local:07302] mca: base: close: unloading component radix
[compute-1-1.local:07302] mca: base: close: unloading component debruijn
[compute-1-1.local:07302] orte_routed_base_select: initializing selected 
component binomial
[compute-1-2.local:13744] [[63041,0],2]: parent 0 num_children 0
Daemon [[63041,0],2] checking in as pid 13744 on host c1-2
[compute-1-2.local:13744] [[63041,0],2] orted: up and running - waiting for 
commands!
[compute-1-2.local:13744] [[63041,0],2] tcp_peer_send_blocking: send() to 
socket 9 failed: Broken pipe (32)
[compute-1-2.local:13744] mca: base: close: unloading component binomial
[compute-1-1.local:07302] [[63041,0],1]: parent 0 num_children 0
Daemon [[63041,0],1] checking in as pid 7302 on host c1-1
[compute-1-1.local:07302] [[63041,0],1] orted: up and running - waiting for 
commands!
[compute-1-1.local:07302] [[63041,0],1] tcp_peer_send_blocking: send() to 
socket 9 failed: Broken pipe (32)
[compute-1-1.local:07302] mca: base: close: unloading component binomial
srun: error: c1-1: task 0: Exited with exit code 1
srun: Terminating job step 4538.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: c1-2: task 1: Exited with exit code 1


I have also tested version 2.0.1 - this one works without problems.

In my case the problem appears on one system with slurm versions 15.08.8 and 
15.08.12. On another system running 15.08.8 all is working fine, so I guess it 
is not about SLURM version, but maybe system / network configuration?

Following that thought I have also noticed this thread:

http://users.open-mpi.narkive.com/PwJpWXLm/ompi-users-tcp-peer-send-blocking-send-to-socket-9-failed-broken-pipe-32-on-openvz-containers

As Jeff suggested there, I tried to run with --mca btl_tcp_if_include em1 --mca 
oob_tcp_if_include em1, but got the same error.

Could these problems be related to interface naming / lack of infiniband? Or to 
the fact that the front-end node, from which I execute mpirun, has a different 
network configuration? The system, on which things don't work, only has TCP  
network interfaces:

em1, lo (frontend has em1, em4 - local compute network, lo)

while the cluster, on which openmpi does work, uses infiniband, and had the 
following tcp interfaces:

eth0, eth1, ib0, lo

I would appreciate any hints..

Thanks!

Marcin


On 04/01/2016 04:16 PM, Jeff Squyres (jsquyres) wrote:

Ralph --

What's the state of PMI integration with SLURM in the v1.10.x series?  (I 
haven't kept up with SLURM's recent releases to know if something broke between 
existing Open MPI releases and their new releases...?)




On Mar 31, 2016, 

[OMPI users] Performance issues: 1.10.x vs 2.x

2017-05-04 Thread marcin.krotkiewski

Hi, everyone,

I ran some bandwidth tests on two different systems with Mellanox IB 
(FDR and EDR). I compiled the three supported versions of openmpi 
(1.10.6, 2.0.2, 2.1.0) and measured the time it takes to send/receive 
4MB arrays of doubles betweentwo hosts connected to the same IB switch. 
MPI_Send/MPI_Recv were performed 1000 times, andthe table below gives 
the average bandwidth obtained [MB/s]:


OpenMPI FDR   EDR
1.10.6   6203.011271.1
2.0.2 5128.4 11948.0
2.1.0 5095.1 11947.2

openib btl was used to transfer the data. The resultsare puzzling: it 
seems that something changed starting from version 2.x, and the FDR 
system performs much worse than with the prior 1.10.x release. On the 
EDR system I see the opposite (v2.x are better), but the difference is 
not so dramatic.


Did anyone experience similar behavior? Is this due to OpenMPI, or 
something else? The two systems run Centos (FDR:6.8, EDR:7.3), and 
Mellanox OFED with a minor version difference.


I'd appreciate any thoughts.

Thanks a lot!

Marcin Krotkiewski


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Performance issues: 1.10.x vs 2.x

2017-05-05 Thread marcin.krotkiewski
Thanks, Paul. That was useful! although in my case it was enough to 
allocate my own arrays using posix_memalign. The internals of OpenMPI 
did not play any role, which I guess is quite natural assuming OpenMPI 
doesn't reallocate.


But since that worked, it means that 1.10.6 deals somehow better with 
unaligned data. Anyone knows the reason for this?


Marcin


On 05/04/2017 04:29 PM, Paul Kapinos wrote:

Note that 2.x lost the memory hooks, cf. the thread
https://www.mail-archive.com/devel@lists.open-mpi.org/msg00039.html

The numbers you have looks like 20% loss we also have seen with 4.x 
vs. 1.10.x versions. Try the dirty hook with 'memalign', LD_PRELOAD this:


$ cat alignmalloc64.c
/* Dirk Schmidl (ds53448b), 01/2012 */
#include 
void* malloc(size_t size){
return memalign(64,size);
}

$ gcc -c -fPIC alignmalloc64.c
$ gcc -shared -Wl,-soname,$(LIBNAME64) -o $(LIBNAME64) alignmalloc64.o




On 05/04/17 12:27, marcin.krotkiewski wrote:
The resultsare puzzling: it seems that something changed starting 
from version
2.x, and the FDR system performs much worse than with the prior 
1.10.x release.





___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Bandwidth efficiency advice

2017-05-26 Thread marcin.krotkiewski

Dear All,

I would appreciate some general advice on how to efficiently implement 
the following scenario.


I am looking into how to send a large amount of data over IB _once_, to 
multiple receivers. The trick is, of course, that while the ping-pong 
benchmark delivers great bandwidth, it does so by re-using the already 
registered memory buffers. Since I need to send the data once, the 
memory registration penalty is not easily avoided. I've been looking 
into the following approaches:


1. have multiple ranks send different parts of the data to different 
receivers, in the hope that the memory registration cost will be hidden
2. pre-register two smaller buffers, into which a data is copied before 
sending


The first approach is the best I've managed so far, but the bandwidth 
reached is still lower than what I observe using the pingpong benchmark. 
Also, the performance depends on the number of sending ranks and drops 
if there are too many.


In the second approach one pays for a data copy. My thinking was that 
since the effective memory bandwidth available on a single modern CPU is 
larger than the IB bandwidth, I could squeeze out some performance by 
combining double buffering and multithreading, e.g.,


Step 1. thread A sends the data in the current buffer. Behind the 
scenes, thread B copies data from memory to the next buffer

Step 2. buffers are switched

A similar idea would be to use MPI_Get on the remote rank. The sender 
would copy the data from the memory to the second buffer while the RMA 
window with the first buffer is exposed. In theory, I would expect those 
two operations to be executed simultaneously, with the memory copy 
hopefully hidden behind the IB transfer.


Of course, the experiments didn't really work. While the first 
(multi-rank) approach is OK and shows some improvement, the bandwidth 
could still be improved. None of my double-buffering approaches worked 
at all, possibly because memory bandwidth contention.


So I was wondering, has any of you had any experience with similar 
approaches? In your experience, what would be the best approach?


Thanks a lot!

Marcin

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] Wrong distance calculations in multi-rail setup?

2015-08-28 Thread marcin.krotkiewski
I have a 4-socket machine with two dual-port Infiniband cards (devices 
mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different 
CPUs (I hope..), both ports are active on both cards, everything 
connected to the same physical network.


I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks 
bound to the 4 sockets, hoping to use both IB cards (and both ports):


mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self 
--mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv


but OpenMPI refuses to use the mlx4_1 device

[node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is 
too far away

[ the same for other ranks ]

This is confusing, since I have read that OpenMPI automatically uses a 
closer HCA, so at least some (>=one) rank should choose mlx4_1. I use 
binding by socket, here is the reported map:


[node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]: 
[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././././.]
[node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]: 
[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././././.]
[node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]: 
[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././././.]
[node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]: 
[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././././.]


To check what's going on I have modified btl_openib_component.c to print 
the computed distances.


opal_output_verbose(1, ompi_btl_base_framework.framework_output,
"[rank=%d] openib: device %d/%d distance %lf",
ORTE_PROC_MY_NAME->vpid,
(int)i, (int)num_devs,
(double)dev_sorted[i].distance);

Here is what I get:

[node1.local:28265] [rank=0] openib: device 0/2 distance 0.00
[node1.local:28266] [rank=1] openib: device 0/2 distance 0.00
[node1.local:28267] [rank=2] openib: device 0/2 distance 0.00
[node1.local:28268] [rank=3] openib: device 0/2 distance 0.00
[node1.local:28265] [rank=0] openib: device 1/2 distance 2.10
[node1.local:28266] [rank=1] openib: device 1/2 distance 1.00
[node1.local:28267] [rank=2] openib: device 1/2 distance 2.10
[node1.local:28268] [rank=3] openib: device 1/2 distance 2.10

So the computed distance for mlx4_0 is 0 on all ranks. I believe this 
should not be so. The distance should be smaller on 1 rank and larger 
for 3 others, as is the case for mlx4_1. Looks like a bug?


Another question is, In my configuration two ranks will have a 'closer' 
IB card, but two others will not. Since the correct distance to both 
devices will likely be equal, which device will they choose, if they do 
that automatically? I'd rather they didn't both choose mlx4_0.. I guess 
it would be nice if I could by hand specify the device/port, which 
should be used by a given MPI rank. Is this (going to be) possible with 
OpenMPI?


Thanks a lot,

Marcin



[OMPI users] runtime MCA parameters

2015-09-15 Thread marcin.krotkiewski
I was wondering if it is possible, or considered to make it possible to 
change the various MCA parameters by individual ranks during runtime in 
addition to the command line?


I tried to google a bit, but did not get any indication that such topic 
has even been discussed. It would be a very useful thing, especially in 
multi-threaded applications when using MPI_THREAD_MULTIPLE, but I could 
come up with plenty uses in usual single-threaded ranks setups.


Marcin


Re: [OMPI users] runtime MCA parameters

2015-09-16 Thread marcin.krotkiewski

Thanks a lot, that looks right! Looks like some reading to do..

Do you know if in the OpenMPI implementation the MPI_T-interfaced MCA 
settings are thread-local, or rank-local?


Thanks!

Marcin


On 09/15/2015 07:58 PM, Nathan Hjelm wrote:

You can use MPI_T to set any MCA variable before MPI_Init. At this time
we lock down all MCA variable during MPI_Init. You will need to call
MPI_T_init_thread before MPI_Init and make sure to call MPI_T_finalize
any time after you are finished setting MCA variables. For more
information see MPI-3.1 chapter 14.

-Nathan

On Tue, Sep 15, 2015 at 07:40:56PM +0200, marcin.krotkiewski wrote:

I was wondering if it is possible, or considered to make it possible to
change the various MCA parameters by individual ranks during runtime in
addition to the command line?

I tried to google a bit, but did not get any indication that such topic has
even been discussed. It would be a very useful thing, especially in
multi-threaded applications when using MPI_THREAD_MULTIPLE, but I could come
up with plenty uses in usual single-threaded ranks setups.

Marcin
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27576.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27578.php




[OMPI users] bug in MPI_Comm_accept?

2015-09-16 Thread marcin.krotkiewski
I have run into a freeze / potential bug when using MPI_Comm_accept in a 
simple client / server implementation. I have attached two simplest 
programs I could produce:


 1. mpi-receiver.c opens a port using MPI_Open_port, saves the port 
name to a file


 2. mpi-receiver enters infinite loop and waits for connections using 
MPI_Comm_accept


 3. mpi-sender.c connects to that port using MPI_Comm_connect, sends 
one MPI_UNSIGNED_LONG, calls barrier and disconnects using 
MPI_Comm_disconnect


 4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier 
and disconnects using MPI_Comm_disconnect and goes to point 2 - infinite 
loop


All works fine, but only exactly 5 times. After that the receiver hangs 
in MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I 
have tried with Intel MPI - no such problem.


I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong, or 
is it some problem with internal state of OpenMPI?


Thanks a lot!

Marcin

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
  MPI_Info info;
  char port_name[MPI_MAX_PORT_NAME];
  MPI_Comm intercomm;

  MPI_Init(&argc, &argv);
  MPI_Info_create(&info);
  MPI_Open_port(info, port_name);
  printf("port name: %s\n", port_name);

  /* write port name to file */   
  {
FILE *fd;
fd = fopen("port.txt", "w+");
fprintf(fd, "%s", port_name);
fclose(fd);
  }

  /* accept connections */
  while(1){
unsigned long data;

/* accept connection */
MPI_Comm_accept(port_name, info, 0, MPI_COMM_WORLD, &intercomm);

/* receive comm size from the sender */
MPI_Recv(&data, 1, MPI_UNSIGNED_LONG, 0, 1, intercomm, MPI_STATUS_IGNORE);
printf("received data: %lx\n", data);

MPI_Barrier(intercomm);
MPI_Comm_disconnect(&intercomm);
printf("client disconnected\n");   
  }
}
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  char port_name[MPI_MAX_PORT_NAME+1];
  MPI_Info info;
  MPI_Comm intercomm;
  unsigned long data = 0x12345678;

  /* initialize MPI */
  MPI_Init(&argc, &argv);
  MPI_Info_create(&info);

  /* connect to receiver ranks - port is a string parameter */
  strcpy(port_name, argv[1]);

  /* connect to server - intercomm is the remote communicator */
  MPI_Comm_connect(port_name, info, 0, MPI_COMM_WORLD, &intercomm);
  printf("** connected\n");

  /* send data */
  MPI_Send(&data, 1, MPI_UNSIGNED_LONG, 0, 1, intercomm);
  MPI_Barrier(intercomm);

  /* disconnect */
  MPI_Comm_disconnect(&intercomm);
  MPI_Finalize();
  printf("** disconnected\n");

  return 0;
}


Re: [OMPI users] bug in MPI_Comm_accept?

2015-09-16 Thread marcin.krotkiewski


I have removed the MPI_Barrier, to no avail. Same thing happens. Adding 
verbosity, before the receiver hangs I get the following message


[node2:03928] mca: bml: Using openib btl to [[12620,1],0] on node node3

So It is somewhere in the openib btl module

Marcin


On 09/16/2015 04:34 PM, Jalel Chergui wrote:
Right, anyway Finalize is necessary at the end of the receiver. The 
other issue is Barrier which is invoked probably when the sender has 
exited hence changing the size of intercom. Can you comment that line 
in both files ?


Jalel

Le 16/09/2015 16:22, Marcin Krotkiewski a écrit :
But where would I put it? If I put it in the while(1), then 
MPI_Comm_Accept cannot be called for the second time. If I put it 
outside of the loop it will never be called.



On 09/16/2015 04:18 PM, Jalel Chergui wrote:

Can you check with an MPI_Finalize in the receiver ?
Jalel

Le 16/09/2015 16:06, marcin.krotkiewski a écrit :
I have run into a freeze / potential bug when using MPI_Comm_accept 
in a simple client / server implementation. I have attached two 
simplest programs I could produce:


 1. mpi-receiver.c opens a port using MPI_Open_port, saves the port 
name to a file


 2. mpi-receiver enters infinite loop and waits for connections 
using MPI_Comm_accept


 3. mpi-sender.c connects to that port using MPI_Comm_connect, 
sends one MPI_UNSIGNED_LONG, calls barrier and disconnects using 
MPI_Comm_disconnect


 4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls 
barrier and disconnects using MPI_Comm_disconnect and goes to point 
2 - infinite loop


All works fine, but only exactly 5 times. After that the receiver 
hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100% 
repeatable. I have tried with Intel MPI - no such problem.


I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth 
wrong, or is it some problem with internal state of OpenMPI?


Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27585.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, Bât. 508 - BP 133, 91403 Orsay cedex, FRANCE
  Tél: (33 1) 69 85 81 27 ; Télécopie: (33 1) 69 85 80 88
  Mél:jalel.cher...@limsi.fr  ; Référence:http://perso.limsi.fr/chergui
**


___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27586.php




___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27587.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, Bât. 508 - BP 133, 91403 Orsay cedex, FRANCE
  Tél: (33 1) 69 85 81 27 ; Télécopie: (33 1) 69 85 80 88
  Mél:jalel.cher...@limsi.fr  ; Référence:http://perso.limsi.fr/chergui
**


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27588.php




Re: [OMPI users] bug in MPI_Comm_accept? (UNCLASSIFIED)

2015-09-16 Thread marcin.krotkiewski

Thank you all for your replies.

I have now tested the code with various setups and versions. First of 
all, the tcp btl seems to work fine (I had patience to check ~10 runs), 
the openib is the problem. I have also compiled using the Intel compiler 
and the story is the same as when using gcc.


I have then tested many openmpi versions from 1.7.5 to 1.10.0 using 
bisection ;) Versions up to and including 1.8.3 worked fine (at least 
above 5 times and around 10), the problem was likely introduced in 
version 1.8.4. Actually, version 1.8.4 was the only one to spit out some 
interesting warning on the receiver side at the moment it hang:


[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one 
event_base_loop can run on each event_base at once.


which may or may not be of importance in this particular case ;)

So to summarize, problem appeared in openib btl in version 1.8.4.

Does anybody have any more ideas?

Thanks!

Marcin



On 09/16/2015 05:59 PM, Burns, Andrew J CTR USARMY RDECOM ARL (US) wrote:

CLASSIFICATION: UNCLASSIFIED

Have you attempted using 2 cores per process? I have noticed that 
MPI_Comm_accept sometimes behaves strangely on single core variations.

I have a program that makes use of Comm_accept/connect and I also call 
MPI_Comm_merge. So, you may want to look into that call as well.

-Andrew Burns

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jalel Chergui
Sent: Wednesday, September 16, 2015 11:49 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] bug in MPI_Comm_accept?

This email was sent from a non-Department of Defense email account, and 
contained active links. All links are disabled, and require you to copy and 
paste the address to a Web browser. Please verify the identity of the sender, 
and confirm authenticity of all links contained within the message.



  
With openmpi-1.7.5, the sender segfaults.

Sorry, I cannot see the problem in the codes. Perhaps people out there may help.

Jalel


Le 16/09/2015 16:40, marcin.krotkiewski a ?crit :

I have removed the MPI_Barrier, to no avail. Same thing happens. Adding 
verbosity, before the receiver hangs I get the following message

[node2:03928] mca: bml: Using openib btl to [[12620,1],0] on node node3

So It is somewhere in the openib btl module

Marcin


On 09/16/2015 04:34 PM, Jalel Chergui wrote:
Right, anyway Finalize is necessary at the end of the receiver. The other issue 
is Barrier which is invoked probably when the sender has exited hence changing 
the size of intercom. Can you comment that line in both files ?

Jalel

Le 16/09/2015 16:22, Marcin Krotkiewski a ?crit :
But where would I put it? If I put it in the while(1), then MPI_Comm_Accept 
cannot be called for the second time. If I put it outside of the loop it will 
never be called.


On 09/16/2015 04:18 PM, Jalel Chergui wrote:
Can you check with an MPI_Finalize in the receiver ?
Jalel

Le 16/09/2015 16:06, marcin.krotkiewski a ?crit :
I have run into a freeze / potential bug when using MPI_Comm_accept in a simple 
client / server implementation. I have attached two simplest programs I could 
produce:

  1. mpi-receiver.c opens a port using MPI_Open_port, saves the port name to a 
file

  2. mpi-receiver enters infinite loop and waits for connections using 
MPI_Comm_accept

  3. mpi-sender.c connects to that port using MPI_Comm_connect, sends one 
MPI_UNSIGNED_LONG, calls barrier and disconnects using MPI_Comm_disconnect

  4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier and 
disconnects using MPI_Comm_disconnect and goes to point 2 - infinite loop

All works fine, but only exactly 5 times. After that the receiver hangs in 
MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I have 
tried with Intel MPI - no such problem.

I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong, or is it 
some problem with internal state of OpenMPI?

Thanks a lot!

Marcin




___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
Caution-www.open-mpi.org/community/lists/users/2015/09/27585.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, B?t. 508 - BP 133, 91403 Orsay cedex, FRANCE
  T?l: (33 1) 69 85 81 27 ; T?l?copie: (33 1) 69 85 80 88
  M?l: jalel.cher...@limsi.fr<mailto:jalel.cher...@limsi.fr> ; R?f?rence: 
Caution-perso.limsi.fr/chergui
**




___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
Caution-www.open-mpi.o

[OMPI users] Using POSIX shared memory as send buffer

2015-09-27 Thread marcin.krotkiewski

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX 
shared memory region (/dev/shm). The memory is shared among many MPI 
processes within the same compute node. Essentially, I see a bit hectic 
performance, but it seems that my code it is roughly twice slower than 
when using a usual, malloced send buffer.


I was wondering - has any of you had experience with sending SHM over 
Infiniband? why would I see so much worse results? Is it e.g., because 
this memory cannot be pinned and OpenMPI is reallocating it? Or is it 
some OS peculiarity?


I would appreciate any hints at all. Thanks a lot !

Marcin



Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread marcin.krotkiewski


I've now run a few more tests and I think I can reasonably confidently 
say that the read only mmap is a problem. Let me know if you have a 
possible fix - I will gladly test it.


Marcin


On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

We register the memory with the NIC for both read and write access. This
may be the source of the slowdown. We recently added internal support to
allow the point-to-point layer to specify the access flags but the
openib btl does not yet make use of the new support. I plan to make the
necessary changes before the 2.0.0 release. I should have them complete
later this week. I can send you a note when they are ready if you would
like to try it and see if it addresses the problem.

-Nathan

On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

Thanks, Dave.

I have verified the memory locality and IB card locality, all's fine.

Quite accidentally I have found that there is a huge penalty if I mmap the
shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
although I must look at this further. I'll report when I am certain, in case
sb finds this useful.

Is this an OS feature, or is OpenMPI somehow working differently? I don't
suspect you guys write to the send buffer, right? Even if you would there
would be a segfault. So I guess this could be OS preventing any writes to
the pointer that introduced the overhead?

Marcin



On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski  
wrote:

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX shared 
memory region (/dev/shm). The memory is shared among many MPI processes within 
the same compute node. Essentially, I see a bit hectic performance, but it 
seems that my code it is roughly twice slower than when using a usual, malloced 
send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your shared 
memory vs. your private (malloced) memory.  If you have a multi-NUMA-domain system (i.e., 
any 2+ socket server, and even some single-socket servers) then you are likely to run 
into this sort of issue.  The PCI bus on which your IB HCA communicates is almost 
certainly closer to one NUMA domain than the others, and performance will usually be 
worse if you are sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, though I don't 
know if it knows how to show memory affinity.  I think you can find memory affinity for a process via 
"/proc//numa_maps".  There's lots of info about NUMA affinity here: 
https://queue.acm.org/detail.cfm?id=2513149

-Dave

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27705.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27711.php




Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread marcin.krotkiewski

Thank you, and Jeff, for clarification.

Before I bother you all more without the need, I should probably say I 
was hoping to use libfabric/OpenMPI on an InfiniBand cluster. Somehow 
now I feel I have confused this altogether, so maybe I should go one 
step back:


 1. libfabric is hardware independent, and does support Infiniband, right?
 2. I read that OpenMPI provides interface to libfabric through 
btl/usnic and mtl/ofi.  can any of those use libfabric on Infiniband 
networks?


Please forgive my ignorance, the amount of different options is rather 
overwhelming..


Marcin



On 09/30/2015 04:26 PM, Howard Pritchard wrote:


Hello Marcin

What configure options are you using besides with-libfabric?

Could you post your config.log file tp the list?

Looks like you only install fi_ext_usnic.h if you could build the 
usnic libfab provider.  When you configured libfabric what providers 
were listed at the end of configure run? Maybe attach config.log from 
the libfabric build ?


If your cluster has cisco usnics you should probably be using 
libfabric/cisco openmpi.  If you are using intel omnipath you may want 
to try the ofi mtl.  Its not selected by default however.


Howard

--

sent from my smart phonr so no good type.

Howard

On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski" 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Hi,

I am trying to compile the 2.x branch with libfabric support, but
get this error during configure:

configure:100708: checking rdma/fi_ext_usnic.h presence
configure:100708: gcc -E
-I/cluster/software/VERSIONS/openmpi.gnu.2.x/include

-I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
conftest.c
conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such file
or directory
[...]
configure:100708: checking for rdma/fi_ext_usnic.h
configure:100708: result: no
configure:101253: checking if MCA component btl:usnic can compile
configure:101255: result: no

Which is correct - the file is not there. I have downloaded fresh
libfabric-1.1.0.tar.bz2 and it does not have this file. Probably
OpenMPI needs some updates?

I am also wondering what is the state of libfabric support in
OpenMPI nowadays. I have seen recent (March) presentation about
it, so it seems to be an actively developed feature. Is this
correct? It seemed from the presentation that there are benefits
to this approach, but is it mature enough in OpenMPI, or it will
yet take some time?

Thanks!

Marcin
___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27728.php



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27733.php




Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-30 Thread marcin.krotkiewski

Hi, Nathan

I have compiled 2.x with your patch. I must say it works _much_ better 
with your changes. I have no idea how you figured that out! A short 
table with my bandwidth calculations (MB/s)


PROT_READPROT_READ | PROT_WRITE
1.10.025005700
2.x+patch 4800-52005700

That is not a very thorough study, but essentially I was getting 
2500MB/s with read-only shm. With your patch it is somewhat shaky (very 
rarely I get 2500 also), but most of the time it is around 5000MB/s.


Seems mmaping the memory read-write still yields marginally better 
results. Again, I do not have very solid data to support it - just a 
bunch of runs.


Do you have an idea as to why such performance difference exists?

Thanks a lot!

Marcin


On 09/30/2015 12:37 AM, Nathan Hjelm wrote:

There was a bug in that patch that affected IB systems. Updated patch:

https://github.com/hjelmn/ompi/commit/c53df23c0bcf8d1c531e04d22b96c8c19f9b3fd1.patch

-Nathan

On Tue, Sep 29, 2015 at 03:35:21PM -0600, Nathan Hjelm wrote:

I have a branch with the changes available at:

https://github.com/hjelmn/ompi.git

in the mpool_update branch. If you prefer you can apply this patch to
either a 2.x or a master tarball.

https://github.com/hjelmn/ompi/commit/8839dbfae85ba8f443b2857f9bbefdc36c4ebc1a.patch

Let me know if this resolves the performance issues.

-Nathan

On Tue, Sep 29, 2015 at 09:57:54PM +0200, marcin.krotkiewski wrote:

I've now run a few more tests and I think I can reasonably confidently say
that the read only mmap is a problem. Let me know if you have a possible
fix - I will gladly test it.

Marcin

On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

  We register the memory with the NIC for both read and write access. This
  may be the source of the slowdown. We recently added internal support to
  allow the point-to-point layer to specify the access flags but the
  openib btl does not yet make use of the new support. I plan to make the
  necessary changes before the 2.0.0 release. I should have them complete
  later this week. I can send you a note when they are ready if you would
  like to try it and see if it addresses the problem.

  -Nathan

  On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

  Thanks, Dave.

  I have verified the memory locality and IB card locality, all's fine.

  Quite accidentally I have found that there is a huge penalty if I mmap the
  shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
  although I must look at this further. I'll report when I am certain, in case
  sb finds this useful.

  Is this an OS feature, or is OpenMPI somehow working differently? I don't
  suspect you guys write to the send buffer, right? Even if you would there
  would be a segfault. So I guess this could be OS preventing any writes to
  the pointer that introduced the overhead?

  Marcin



  On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

  On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski 
 wrote:

  Hello, everyone

  I am struggling a bit with IB performance when sending data from a POSIX 
shared memory region (/dev/shm). The memory is shared among many MPI processes 
within the same compute node. Essentially, I see a bit hectic performance, but 
it seems that my code it is roughly twice slower than when using a usual, 
malloced send buffer.

  It may have to do with NUMA effects and the way you're allocating/touching your shared 
memory vs. your private (malloced) memory.  If you have a multi-NUMA-domain system (i.e., 
any 2+ socket server, and even some single-socket servers) then you are likely to run 
into this sort of issue.  The PCI bus on which your IB HCA communicates is almost 
certainly closer to one NUMA domain than the others, and performance will usually be 
worse if you are sending/receiving from/to a "remote" NUMA domain.

  "lstopo" and other tools can sometimes help you get a handle on the situation, though I don't 
know if it knows how to show memory affinity.  I think you can find memory affinity for a process via 
"/proc//numa_maps".  There's lots of info about NUMA affinity here: 
https://queue.acm.org/detail.cfm?id=2513149

  -Dave

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27705.php

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post:

Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread marcin.krotkiewski


Thank you for this clear explanation. I do not have True Scale on 'my' 
machine, so unless Mellanox gets involved - no juice for me.


Makes me wonder. libfabric is marketed as a next-generation solution. 
Clearly it has some reported advantage for Cisco usnic, but since you 
claim no improvement over psm, then I guess it is nothing to look 
forward to, is it?


Anyway, thanks a lot for clearing this up

Marcin


On 09/30/2015 08:13 PM, Howard Pritchard wrote:

Hi Marcin,


2015-09-30 9:19 GMT-06:00 marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>>:


Thank you, and Jeff, for clarification.

Before I bother you all more without the need, I should probably
say I was hoping to use libfabric/OpenMPI on an InfiniBand
cluster. Somehow now I feel I have confused this altogether, so
maybe I should go one step back:

 1. libfabric is hardware independent, and does support
Infiniband, right?


The short answer is yes libfabric is hardware independent (and does 
work on goods days on os-x as well as linux).
The longer answer is that there has been more/less work on 
implementing providers (the plugins in to libfabric

to interface to different networks) for different networks.

There is a socket provider.  That gets a good amount of attention 
because its a base reference provider.
psm/psm2 providers are available.  I have used the psm provider some 
on a truescale cluster.  It doesn't
offer better performance than just using psm directly, but it does 
appear to work.


There is an mxm provider but it was not implemented by mellanox, and I 
can't get it to compile on my

connectx3 system using mxm 1.5.

There is a vanilla verbs provider but it doesn't support FI_EP_RDM 
endpoint type, which is used by

the non-cisco component of Open MPI (ofi mtl) which is available.

When you build and install libfabric, there should be an fi_info 
binary installed in $(LIBFABRIC_INSTALL_DIR)/bin

On my truescale cluster the output is:

psm: psm

version: 0.9

type: FI_EP_RDM

protocol: FI_PROTO_PSMX

verbs: IB-0x80fe

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_RDMA_CM_IB_RC

sockets: IP

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_DGRAM

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_RDM

protocol: FI_PROTO_SOCK_TCP

In order to use the mtl/ofi, at a minimum a provider needs to support 
FI_EP_RDM type (see above).  Note that on the truescale
cluster the verbs provider is built, but it only supports FI_EP_MSG 
endpoint types.  So mtl/ofi can't use that.


 2. I read that OpenMPI provides interface to libfabric through
btl/usnic and mtl/ofi.  can any of those use libfabric on
Infiniband networks?


if you have intel truescale or its follow-on then the answer is yes, 
although the default is for Open MPI to use mtl/psm on that network.



Please forgive my ignorance, the amount of different options is
rather overwhelming..

Marcin



On 09/30/2015 04:26 PM, Howard Pritchard wrote:


Hello Marcin

What configure options are you using besides with-libfabric?

Could you post your config.log file tp the list?

Looks like you only install fi_ext_usnic.h if you could build the
usnic libfab provider.  When you configured libfabric what
providers were listed at the end of configure run? Maybe attach
config.log from the libfabric build ?

If your cluster has cisco usnics you should probably be using
libfabric/cisco openmpi. If you are using intel omnipath you may
want to try the ofi mtl.  Its not selected by default however.

Howard

--

sent from my smart phonr so no good type.

Howard

On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski"
mailto:marcin.krotkiew...@gmail.com>> wrote:

Hi,

I am trying to compile the 2.x branch with libfabric support,
but get this error during configure:

configure:100708: checking rdma/fi_ext_usnic.h presence
configure:100708: gcc -E
-I/cluster/software/VERSIONS/openmpi.gnu.2.x/include

-I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
conftest.c
conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such
file or directory
[...]
configure:100708: checking for rdma/fi_ext_usnic.h
configure:100708: result: no
configure:101253: checking if MCA component btl:usnic can compile
configure:101255: result: no

Which is correct - the file is not there. I have downloaded
fresh libfabric-1.1.0.tar.bz2 and it does not have this file.
Probably OpenMPI needs some updates?

I am also wondering what is the state of libfabric support in
OpenMPI nowadays. I have seen recent (March) presentation
about

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster 
nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
processes running on a given cluster node. Compute nodes are 2-socket, 
8-core E5-2670 systems with HyperThreading on


node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node   0   1
  0:  10  21
  1:  21  10

I run MPI program with command

mpirun  --report-bindings --bind-to core -np 64 ./affinity

The program simply runs sched_getaffinity for each process and prints 
out the result.


---
TEST RUN 1
---
For this particular job the problem is more severe: openmpi fails to run 
at all with error


--
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:c6-6
  Application name:  ./affinity
  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:  odls_default_module.c:551
--

This is SLURM environment variables:

SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'

There is also a lot of warnings like

[compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
available processors)



---
TEST RUN 2
---

In another allocation I got a different error

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c6-19
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

and the allocation was the following

SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'


If in this case I run on only 32 cores

mpirun  --report-bindings --bind-to core -np 32 ./affinity

the process starts, but I get the original binding problem:

[compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
available processors)


Running with --hetero-nodes yields exactly the same results





Hope the above is useful. The problem with binding under SLURM with CPU 
cores spread over nodes seems to be very reproducible. It is actually 
very often that OpenMPI dies with some error like above. These tests 
were run with openmpi-1.8.8 and 1.10.0, both giving same results.


One more suggestion. The warning message (MCW rank 8 is not bound...) is 
ONLY displayed when I use --report-bindings. It is never shown if I 
leave out this option, and although the binding is wrong the user is not 
notified. I think it would be better to show this warning in all cases 
binding fails.


Let me know if you need more information. I can help to debug this - it 
is a rather crucial issue.


Thanks!

Marcin






On 10/02/2015 11:49 PM, Ralph Castain wrote:

Can you please send me the allocation request you made (so I can see what you 
specified on the cmd line), and the mpirun cmd line?

Thanks
Ralph


On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski  
wrote:

Hi,

I fail to make OpenMPI bind to cores correctly when running from within 
SLURM-allocated CPU resources spread over a range of compute nodes in an 
otherwise homogeneous cluster. I have found this thread

http://www.open-mpi.org/community/lists/users/2014/06/24682.php

and did try to use what Ralph suggested there (--hetero-nodes), but it does not 
work (v. 1.10.0). When running with --report-bindings I get messages like

[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available 
processors)

for all ranks outside 

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


On 10/03/2015 01:06 PM, Ralph Castain wrote:

Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?
Not to the best of my knowledge, and at least not intentionally. SLURM 
starts as many processes as there are physical cores, not threads. To 
verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go onto 
c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, sched_getaffinity 
in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no binding):

rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,

rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,



Still looks ok to me. If I now turn the binding on, openmpi fails:


--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c1-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The above tests were done with 1.10.1rc1, so it does not fix the problem.

Marcin



I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.



On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski  
wrote:

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster nodes. 
SLURM uses cgroups to limit t

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course get 
the right mapping as we’ll just inherit whatever we received. 
Yes. I meant that whatever you received (what SLURM gives) is a correct 
cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In 
the case mentioned earlier openmpi should start 6 tasks on c1-30. If HT 
would be treated as separate and independent cores, sched_getaffinity of 
an MPI process started on c1-30 would return a map with 6 entries only. 
In my case it returns a map with 12 entries - 2 for each core. So one  
process is in fact allocated both HTs, not only one. Is what I'm saying 
correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores. 
How do you mean? Is the above understanding wrong? I would expect that 
on c1-30 with --bind-to core openmpi should bind to logical cores 0 and 
16 (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
available in sched_getaffinity map, and there is twice as many logical 
cores as there are MPI processes started on the node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
things. It still might have improved something, but not everything. 
Consider this job:


SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

If I run 32 tasks as follows (with 1.10.1rc1)

mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity

I get the following error:

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c9-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--


If I now use --bind-to core:overload-allowed, then openmpi starts and 
_most_ of the threads are bound correctly (i.e., map contains two 
logical cores in ALL cases), except this case that required the overload 
flag:


rank 15 @ compute-9-31.local   1, 17,
rank 16 @ compute-9-31.local  11, 27,
rank 17 @ compute-9-31.local   2, 18,
rank 18 @ compute-9-31.local  12, 28,
rank 19 @ compute-9-31.local   1, 17,

Note pair 1,17 is used twice. The original SLURM delivered map (no 
binding) on this node is


rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,

Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
Clearly, the original SLURM-delivered map has 5 CPUs included, enough 
for 5 MPI processes.


Cheers,

Marcin




On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that Slurm may be 
treating HTs as “cores” - i.e., as independent cpus. Any chance that 
is true?
Not to the best of my knowledge, and at least not intentionally. 
SLURM starts as many processes as there are physical cores, not 
threads. To verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go 
onto c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, 
sched_getaffinity in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no 
binding):


rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 
1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax is 
a new one to me, but they tend to change things around. Could you run 
lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error message 
is clearly indicating that we are seeing individual HTs (and not 
complete cores) assigned, and I don’t know the source of that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course 
get the right mapping as we’ll just inherit whatever we received.
Yes. I meant that whatever you received (what SLURM gives) is a 
correct cpu map and assigns _whole_ CPUs, not a single HT to MPI 
processes. In the case mentioned earlier openmpi should start 6 tasks 
on c1-30. If HT would be treated as separate and independent cores, 
sched_getaffinity of an MPI process started on c1-30 would return a 
map with 6 entries only. In my case it returns a map with 12 entries 
- 2 for each core. So one  process is in fact allocated both HTs, not 
only one. Is what I'm saying correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores.
How do you mean? Is the above understanding wrong? I would expect 
that on c1-30 with --bind-to core openmpi should bind to logical 
cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All those 
logical cores are available in sched_getaffinity map, and there is 
twice as many logical cores as there are MPI processes started on the 
node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1r

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski


Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings 
--bind-to core -np 32 ./affinity


In case of 1.10.rc1 I have also added :overload-allowed - output in a 
separate file. This option did not make much difference for 1.10.0, so I 
did not attach it here.


First thing I noted for 1.10.0 are lines like

[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS 
NOT BOUND


with an empty BITMAP.

The SLURM environment is

set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

I have submitted an interactive job on screen for 120 hours now to work 
with one example, and not change it for every post :)


If you need anything else, let me know. I could introduce some 
patch/printfs and recompile, if you need it.


Marcin



On 10/03/2015 07:17 PM, Ralph Castain wrote:
Rats - just realized I have no way to test this as none of the 
machines I can access are setup for cgroup-based multi-tenant. Is this 
a debug version of OMPI? If not, can you rebuild OMPI with —enable-debug?


Then please run it with —mca rmaps_base_verbose 10 and pass along the 
output.


Thanks
Ralph


On Oct 3, 2015, at 10:09 AM, Ralph Castain <mailto:r...@open-mpi.org>> wrote:


What version of slurm is this? I might try to debug it here. I’m not 
sure where the problem lies just yet.



On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - 
core 1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax 
is a new one to me, but they tend to change things around. Could 
you run lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error 
message is clearly indicating that we are seeing individual HTs 
(and not complete cores) assigned, and I don’t know the source of 
that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.k

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-04 Thread marcin.krotkiewski

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
know the reason, but I think I know when the problem happens (at least 
on 1.10.1rc1). It seems that by default openmpi maps by socket. The 
error happens when for a given compute node there is a different number 
of cores used on each socket. Consider previously studied case (the 
debug outputs I sent in last post). c1-8, which was source of error, has 
5 mpi processes assigned, and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - extra 
core on socket 1 with no corresponding core on socket 0. At that point 
the returned trg_obj points to the first core on the node (os_index 0, 
socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, the 
error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior between 
1.10.1rc1 and earlier versions. In the SLURM job described in last post, 
1.10.1rc1 fails to bind only in 1 case, while the earlier versions fail 
in 21 out of 32 cases. You mentioned there was a bug in hwloc. Not sure 
if it can explain the difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

I suspect ompi tries to bind to threads outside the cpuset.
this could be pretty similar to a previous issue when ompi tried to 
bind to cores outside the cpuset.
/* when a core has more than one thread, would ompi assume all the 
threads are available if the core is available ? */

I will investigate this from tomorrow

Cheers,

Gilles

On Sunday, October 4, 2015, Ralph Castain <mailto:r...@open-mpi.org>> wrote:


Thanks - please go ahead and release that allocation as I’m not
going to get to this immediately. I’ve got several hot irons in
the fire right now, and I’m not sure when I’ll get a chance to
track this down.

Gilles or anyone else who might have time - feel free to take a
gander and see if something pops out at you.

Ralph



On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
>
wrote:


Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and
executed

mpirun --mca rmaps_base_verbose 10 --hetero-nodes
--report-bindings --bind-to core -np 32 ./affinity

In case of 1.10.rc1 I have also added :overload-allowed - output
in a separate file. This option did not make much difference for
1.10.0, so I did not attach it here.

First thing I noted for 1.10.0 are lines like

[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON
c1-26 IS NOT BOUND

with an empty BITMAP.

The SLURM environment is

set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluste

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski
lt;mailto:gil...@rist.or.jp>> wrote:


Marcin,

i ran a simple test with v1.10.1rc1 under a cpuset with
- one core (two threads 0,16) on socket 0
- two cores (two threads each 8,9,24,25) on socket 1

$ mpirun -np 3 -bind-to core ./hello_c
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:rapid
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

as you already pointed, default mapping is by socket.

so on one hand, we can consider this behavior is a feature :
we try to bind two processes to socket 0, so the --oversubscribe 
option is required

(and it does what it should :
$ mpirun -np 3 -bind-to core --oversubscribe -report-bindings 
./hello_c
[rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
[rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../../../..][BB/../../../../../../..]
[rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)


and on the other hand, we could consider ompi should be a bit 
smarter, and uses socket 1 for task 2 since socket 0 is fully 
allocated and there is room on socket 1.


Ralph, any thoughts ? bug or feature ?


Marcin,

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun

Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do 
not know the reason, but I think I know when the problem happens 
(at least on 1.10.1rc1). It seems that by default openmpi maps by 
socket. The error happens when for a given compute node there is 
a different number of cores used on each socket. Consider 
previously studied case (the debug outputs I sent in last post). 
c1-8, which was source of error, has 5 mpi processes assigned, 
and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. 
Binding progresses correctly up to and including core 13 (see end 
of file out.1.10.1rc2, before the error). That is 2 cores on 
socket 0, and 2 cores on socket 1. Error is thrown when core 14 
should be bound - extra core on socket 1 with no corresponding 
core on socket 0. At that point the returned trg_obj points to 
the first core on the node (os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in 
such situation. Moreover, if I now use --map-by core instead of 
socket, the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior 
bet

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski

Hi, Gilles

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun


It was an interactive job allocated with

salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

The slurm environment is the following

SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

The output of the command you asked for is

0: c1-2.local  Cpus_allowed_list:1-4,17-20
1: c1-4.local  Cpus_allowed_list:1,15,17,31
2: c1-8.local  Cpus_allowed_list:0,5,9,13-14,16,21,25,29-30
3: c1-13.local  Cpus_allowed_list:   3-7,19-23
4: c1-16.local  Cpus_allowed_list:   12-15,28-31
5: c1-23.local  Cpus_allowed_list:   2-4,8,13-15,18-20,24,29-31
6: c1-26.local  Cpus_allowed_list:   1,6,11,13,15,17,22,27,29,31

Running with command

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
--report-bindings --map-by socket -np 32 ./affinity


I have attached two output files: one for the original 1.10.1rc1, one 
for the patched version.


When I said 'failed in one case' I was not precise. I got an error on 
node c1-8, which was the first one to have different number of MPI 
processes on the two sockets. It would also fail on some later nodes, 
just that because of the error we never got there.


Let me know if you need more.

Marcin








Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
know the reason, but I think I know when the problem happens (at 
least on 1.10.1rc1). It seems that by default openmpi maps by socket. 
The error happens when for a given compute node there is a different 
number of cores used on each socket. Consider previously studied case 
(the debug outputs I sent in last post). c1-8, which was source of 
error, has 5 mpi processes assigned, and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - 
extra core on socket 1 with no corresponding core on socket 0. At 
that point the returned trg_obj points to the first core on the node 
(os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, 
the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior between 
1.10.1rc1 and earlier versions. In the SLURM job described in last 
post, 1.10.1rc1 fails to bind only in 1 case, while the earlier 
versions fail in 21 out of 32 cases. You mentioned there was a bug in 
hwloc. Not sure if it can explain the difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

[OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread marcin.krotkiewski

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose 
of cpu binding?



Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. 
This is useful for hybrid jobs, where each MPI process spawns some 
internal worker threads (e.g., OpenMP). The intention is that there are 
2 MPI procs started, each of them 'bound' to 4 cores. SLURM will also 
set an environment variable


SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that 
launches the MPI processes to figure out the cpuset. In case of OpenMPI 
+ mpirun I think something should happen in 
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually 
parsed. Unfortunately, it is never really used...


As a result, cpuset of all tasks started on a given compute node 
includes all CPU cores of all MPI tasks on that node, just as provided 
by SLURM (in the above example - 8). In general, there is no simple way 
for the user code in the MPI procs to 'split' the cores between 
themselves. I imagine the original intention to support this in OpenMPI 
was something like


mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the 
allocated cores between the mpi tasks. Is this right? If so, it seems 
that at this point this is not implemented. Is there plans to do this? 
If no, does anyone know another way to achieve that?


Thanks a lot!

Marcin





Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-05 Thread marcin.krotkiewski

Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an 
error:


$ mpirun --map-by core:pe=4 ./affinity
--
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier when we see 
“cpus_per_task”, then we probably should do so as there isn’t any reason to 
make you set it twice (well, other than trying to track which envar slurm is 
using now).



On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski  
wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of cpu 
binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This is 
useful for hybrid jobs, where each MPI process spawns some internal worker 
threads (e.g., OpenMP). The intention is that there are 2 MPI procs started, 
each of them 'bound' to 4 cores. SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that launches the 
MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I think 
something should happen in orte/mca/ras/slurm/ras_slurm_module.c, where the 
variable _is_ actually parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node includes all 
CPU cores of all MPI tasks on that node, just as provided by SLURM (in the 
above example - 8). In general, there is no simple way for the user code in the 
MPI procs to 'split' the cores between themselves. I imagine the original 
intention to support this in OpenMPI was something like

mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the allocated 
cores between the mpi tasks. Is this right? If so, it seems that at this point 
this is not implemented. Is there plans to do this? If no, does anyone know 
another way to achieve that?

Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27804.php




Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-06 Thread marcin.krotkiewski

Gilles,

Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 - 
thank you. Eagerly waiting for the other patches, let me know and I will 
test them later this week.


Marcin



On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:

Marcin,

my understanding is that in this case, patched v1.10.1rc1 is working 
just fine.

am I right ?

I prepared two patches
one to remove the warning when binding on one core if only one core is 
available,
an other one to add a warning if the user asks a binding policy that 
makes no sense with the required mapping policy


I will finalize them tomorrow hopefully

Cheers,

Gilles

On Tuesday, October 6, 2015, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Hi, Gilles

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and
output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
Cpus_allowed_list /proc/self/status
before invoking mpirun


It was an interactive job allocated with

salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

The slurm environment is the following

SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

The output of the command you asked for is

0: c1-2.local  Cpus_allowed_list:1-4,17-20
1: c1-4.local  Cpus_allowed_list:1,15,17,31
2: c1-8.local  Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30
3: c1-13.local  Cpus_allowed_list:   3-7,19-23
4: c1-16.local  Cpus_allowed_list:   12-15,28-31
5: c1-23.local  Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31
6: c1-26.local  Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31

Running with command

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core
--report-bindings --map-by socket -np 32 ./affinity

I have attached two output files: one for the original 1.10.1rc1,
one for the patched version.

When I said 'failed in one case' I was not precise. I got an error
on node c1-8, which was the first one to have different number of
MPI processes on the two sockets. It would also fail on some later
nodes, just that because of the error we never got there.

Let me know if you need more.

Marcin








Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I
do not know the reason, but I think I know when the problem
happens (at least on 1.10.1rc1). It seems that by default
openmpi maps by socket. The error happens when for a given
compute node there is a different number of cores used on each
socket. Consider previously studied case (the debug outputs I
sent in last post). c1-8, which was source of error, has 5 mpi
processes assigned, and the cpuset is the following:

0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.
Binding progresses correctly up to and including core 13 (see
end of file out.1.10.1rc2, before the error). That is 2 cores on
socket 0, and 2 cores on socket 1. Error is thrown when core 14
should be bound - extra core on socket 1 with no corresponding
core on socket 0. At that point the returned trg_obj points to
the first core on the node (os_index 0, socket 0).

I have submitted a few other jobs and I always had an error in
such situation. Moreover, if I now use --map-by core instead of
socket, the error is gone, and I get my expected binding:

rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.lo

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-06 Thread marcin.krotkiewski
Thank you both for your suggestion. I still cannot make this work 
though, and I think - as Ralph predicted - most problems are likely 
related to non-homogeneous mapping of cpus to jobs. But there is 
problems even before that part..


If I reserve one entire compute node with SLURM:

salloc --ntasks=16 --tasks-per-node=16

I can run my code as you suggested with _any_ N (including odd 
numbers!). OpenMPI will figure out the maximun number of tasks that fits 
and launch them. This also works for many complete nodes, but this is 
the only case when I managed to get it to work.


If I specify cpus per task, also allocating one full node

salloc --ntasks=4 --cpus-per-task=4 --tasks-per-node=4

things go astray:

mpirun --map-by slot:pe=4 ./affinity
rank 0 @ compute-1-6.local  0, 1, 2, 3, 16, 17, 18, 19,

Yes, only one MPI process was started. Running what Gilles previously 
suggested:


$ srun grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31

So the allocation seems fine. The SLURM environment is also correct, as 
far as I can tell:


SLURM_CPUS_PER_TASK=4
SLURM_JOB_CPUS_PER_NODE=16
SLURM_JOB_NODELIST=c1-6
SLURM_JOB_NUM_NODES=1
SLURM_NNODES=1
SLURM_NODELIST=c1-6
SLURM_NPROCS=4
SLURM_NTASKS=4
SLURM_NTASKS_PER_NODE=4
SLURM_TASKS_PER_NODE=4

I do not understand why openmpi does not want to start more than 1 
process. If I try to force it (-n 4) I of course get an error:


mpirun --map-by slot:pe=4 -n 4 ./affinity

--
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  ./affinity

Either request fewer slots for your application, or make more slots 
available

for use.
--


For clarity, I will not describe other cases / non-contiguous cpu sets / 
heterogeneous nodes. Clearly something is wrong already with the simple 
ones..


Does anyone have any ideas? Should I record some logs to see what's 
going on?


Thanks a lot!

Marcin






On 10/06/2015 01:04 AM, tmish...@jcity.maeda.co.jp wrote:

Hi Ralph, it's been a long time.

The option "map-by core" does not work when pe=N > 1 is specified.
So, you should use "map-by slot:pe=N" as far as I remember.

Regards,
Tetsuya Mishima

2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
tasks using SLURM」で書きました

Hmmm…okay, try -map-by socket:pe=4

We’ll still hit the asymmetric topology issue, but otherwise this should

work



On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski

 wrote:

Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an

error:

$ mpirun --map-by core:pe=4 ./affinity


--

A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.


--

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier

when we see “cpus_per_task”, then we probably should do so as there isn’t
any reason to make you set it twice (well, other than

trying to track which envar slurm is using now).



On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski

 wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the

purpose of cpu binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.

This is useful for hybrid jobs, where each MPI process spawns some internal
worker threads (e.g., OpenMP). The intention is

that there are 2 MPI procs started, each of them 'bound' to 4 cores.

SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that

launches the MPI processes to figure out the cpuset. In case of OpenMPI +
mpirun I think something should happen in

orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually

parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node

includes all CPU cores of all MPI tasks on that node, just as provided by
SLURM (in the above example - 8). In general, there is

no simple way for the user code in the MPI procs to 'split' the cores

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-06 Thread marcin.krotkiewski


Thanks, Gilles. This is a good suggestion and I will pursue this 
direction. The problem is that currently SLURM does not support 
--cpu_bind on my system for whatever reasons. I may work towards turning 
this option on if that will be necessary, but it would also be good to 
be able to do it with pure openmpi..


Marcin


On 10/06/2015 08:01 AM, Gilles Gouaillardet wrote:

Marcin,

did you investigate direct launch (e.g. srun) instead of mpirun ?

for example, you can do
srun --ntasks=2 --cpus-per-task=4 -l grep Cpus_allowed_list 
/proc/self/status


note, you might have to use the srun --cpu_bind option, and make sure 
your slurm config does support that :
srun --ntasks=2 --cpus-per-task=4 --cpu_bind=core,verbose -l grep 
Cpus_allowed_list /proc/self/status


Cheers,

Gilles

On 10/6/2015 4:38 AM, marcin.krotkiewski wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the 
purpose of cpu binding?



Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. 
This is useful for hybrid jobs, where each MPI process spawns some 
internal worker threads (e.g., OpenMP). The intention is that there 
are 2 MPI procs started, each of them 'bound' to 4 cores. SLURM will 
also set an environment variable


SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that 
launches the MPI processes to figure out the cpuset. In case of 
OpenMPI + mpirun I think something should happen in 
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ 
actually parsed. Unfortunately, it is never really used...


As a result, cpuset of all tasks started on a given compute node 
includes all CPU cores of all MPI tasks on that node, just as 
provided by SLURM (in the above example - 8). In general, there is no 
simple way for the user code in the MPI procs to 'split' the cores 
between themselves. I imagine the original intention to support this 
in OpenMPI was something like


mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the 
allocated cores between the mpi tasks. Is this right? If so, it seems 
that at this point this is not implemented. Is there plans to do 
this? If no, does anyone know another way to achieve that?


Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27812.php




Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

2015-10-06 Thread marcin.krotkiewski


Ralph, maybe I was not precise - most likely --cpu_bind does not work on 
my system because it is disabled in SLURM, and is not caused by any 
problem in OpenMPI. I am not certain and I will have to investigate this 
further, so please do not waste your time on this.


What do you mean by 'loss of dynaics support'?

Thanks,

Marcin


On 10/06/2015 09:35 PM, Ralph Castain wrote:

I’ll have to fix it later this week - out due to eye surgery today. Looks like 
something didn’t get across to 1.10 as it should have. There are other 
tradeoffs that occur when you go to direct launch (e.g., loss of dynamics 
support) - may or may not be of concern to your usage.



On Oct 6, 2015, at 11:57 AM, marcin.krotkiewski  
wrote:


Thanks, Gilles. This is a good suggestion and I will pursue this direction. The 
problem is that currently SLURM does not support --cpu_bind on my system for 
whatever reasons. I may work towards turning this option on if that will be 
necessary, but it would also be good to be able to do it with pure openmpi..

Marcin


On 10/06/2015 08:01 AM, Gilles Gouaillardet wrote:

Marcin,

did you investigate direct launch (e.g. srun) instead of mpirun ?

for example, you can do
srun --ntasks=2 --cpus-per-task=4 -l grep Cpus_allowed_list /proc/self/status

note, you might have to use the srun --cpu_bind option, and make sure your 
slurm config does support that :
srun --ntasks=2 --cpus-per-task=4 --cpu_bind=core,verbose -l grep 
Cpus_allowed_list /proc/self/status

Cheers,

Gilles

On 10/6/2015 4:38 AM, marcin.krotkiewski wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of cpu 
binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This is 
useful for hybrid jobs, where each MPI process spawns some internal worker 
threads (e.g., OpenMP). The intention is that there are 2 MPI procs started, 
each of them 'bound' to 4 cores. SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that launches the 
MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I think 
something should happen in orte/mca/ras/slurm/ras_slurm_module.c, where the 
variable _is_ actually parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node includes all 
CPU cores of all MPI tasks on that node, just as provided by SLURM (in the 
above example - 8). In general, there is no simple way for the user code in the 
MPI procs to 'split' the cores between themselves. I imagine the original 
intention to support this in OpenMPI was something like

mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the allocated 
cores between the mpi tasks. Is this right? If so, it seems that at this point 
this is not implemented. Is there plans to do this? If no, does anyone know 
another way to achieve that?

Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27812.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27818.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27820.php




Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread marcin.krotkiewski

Dear Ralph, Gilles, and Jeff

Thanks a lot for your effort.. Understanding this problem has been a 
very interesting exercise for me that let me understand OpenMPI much 
better (I think:).


I have given it all a little more thought, and done some more tests on 
our production system, and I think that this is not exactly a 
corner-case. First of all, I suspect all of this holds for other job 
scheduling systems besides SLURM (to be thought about..). Moreover, on 
our system a rather common usage scenario involves SLURM job allocation 
using, e.g.,


salloc --ntasks=32

which results in very fragmented allocations - that's specific for the 
type of problems users use this cluster for, but it's a fact. Users then 
run the job using


mpirun ./program

For versions up to 1.10.0, with uneven resource allocation among compute 
nodes the default binding options used in OpenMPI in most cases result 
in some CPU cores not being present in the used cpuset at all, others 
being over/under-subscribed. This certainly is job-specific and depends 
on how fragmented the SLURM allocations are, but to give a scary number: 
in one case I started 512 tasks (1 per core), and OpenMPI binding 
created a cpuset that used only 271 cores, some of them being 
over/under-subscribed on top of that. Effectively, user gets 50% of what 
he asked for. As already discussed, this happens quietly - the user has 
no idea.


For version 1.10.1rc1 and up the situation is a bit different: it seems 
that in many cases all cores are present in the cpuset, just that the 
binding does not take place in a lot of cases. Instead, processes are 
bound to all cores allocated by SLURM. In other scenarios, as discussed 
before, some cores are over/under-subscribed. Again, this is done quietly.


In all cases what is needed is the --hetero-nodes switch. If I apply the 
patch that Gilles has posted, it seems to be enough for 1.10.1rc1 and 
up. The switch is not enough for earlier versions of OpenMPI and one 
needs --map-by core in addition.


Given all that I think some sort of fix would be in order soon. I agree 
with Ralph that to address this issue quickly a simplified fix would be 
a good choice. As Ralph has already pointed out (or at least how I 
understood it :) this would essentially involve activating 
--hetero-nodes by default, and using --map-by core in cases where the 
architecture is not homogeneous. Uncovering the warning so that the 
failure to bind is not silent is the last piece of puzzle. Maybe adding 
a sanity check to make sure all allocated resources are in use would be 
helpful - if not by default, then maybe with some flag.


Does all this make sense?

Again, thank you all for your help,

Marcin





On 10/07/2015 04:03 PM, Ralph Castain wrote:
I’m a little nervous about this one, Gilles. It’s doing a lot more 
than just addressing the immediate issue, and I’m concerned about any 
potential side-effects that we don’t fully unocver prior to release.


I’d suggest a two-pronged approach:

1. use my alternative method for 1.10.1 to solve the immediate issue. 
It only affects this one, rather unusual, corner-case that was 
reported here. So the impact can be easily contained and won’t impact 
anything else.


2. push your proposed solution to the master where it can soak for 
awhile and give us a chance to fully discover the secondary effects. 
Removing the unused and “not-allowed” cpus from the topology means a 
substantial scrub of the code base in a number of places, and your 
patch doesn’t really get them all. It’s going to take time to ensure 
everything is working correctly again.


HTH
Ralph

On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet 
<mailto:gilles.gouaillar...@gmail.com>> wrote:


Jeff,

there are quite a lot of changes, I did not update master yet (need 
extra pairs of eyes to review this...)
so unless you want to make rc2 today and rc3 a week later, it is imho 
way safer to wait for v1.10.2


Ralph,
any thoughts ?

Cheers,

Gilles

On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:


Is this something that needs to go into v1.10.1?

If so, a PR needs to be filed ASAP.  We were supposed to make the
next 1.10.1 RC yesterday, but slipped to today due to some last
second patches.


> On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet
> wrote:
>
> Marcin,
>
> here is a patch for the master, hopefully it fixes all the
issues we discussed
> i will make sure it applies fine vs latest 1.10 tarball from
tomorrow
>
> Cheers,
>
> Gilles
>
>
> On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
>> Gilles,
>>
>> Yes, it seemed that all was fine with binding in the patched
1.10.1rc1 - thank you. Eagerly waiting for the other patches, let
me know and I will test them later this week.
>>
>> Mar

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread marcin.krotkiewski

Sorry, I think I confused one thing:

On 10/08/2015 09:15 PM, marcin.krotkiewski wrote:


For version 1.10.1rc1 and up the situation is a bit different: it 
seems that in many cases all cores are present in the cpuset, just 
that the binding does not take place in a lot of cases. Instead, 
processes are bound to all cores allocated by SLURM. In other 
scenarios, as discussed before, some cores are over/under-subscribed. 
Again, this is done quietly.


The problem here was in fact failure to run with an error message, not 
under/over-subscription. Sorry for this - wanted to cover too much at 
the same time..


Marcin






In all cases what is needed is the --hetero-nodes switch. If I apply 
the patch that Gilles has posted, it seems to be enough for 1.10.1rc1 
and up. The switch is not enough for earlier versions of OpenMPI and 
one needs --map-by core in addition.


Given all that I think some sort of fix would be in order soon. I 
agree with Ralph that to address this issue quickly a simplified fix 
would be a good choice. As Ralph has already pointed out (or at least 
how I understood it :) this would essentially involve activating 
--hetero-nodes by default, and using --map-by core in cases where the 
architecture is not homogeneous. Uncovering the warning so that the 
failure to bind is not silent is the last piece of puzzle. Maybe 
adding a sanity check to make sure all allocated resources are in use 
would be helpful - if not by default, then maybe with some flag.


Does all this make sense?

Again, thank you all for your help,

Marcin





On 10/07/2015 04:03 PM, Ralph Castain wrote:
I’m a little nervous about this one, Gilles. It’s doing a lot more 
than just addressing the immediate issue, and I’m concerned about any 
potential side-effects that we don’t fully unocver prior to release.


I’d suggest a two-pronged approach:

1. use my alternative method for 1.10.1 to solve the immediate issue. 
It only affects this one, rather unusual, corner-case that was 
reported here. So the impact can be easily contained and won’t impact 
anything else.


2. push your proposed solution to the master where it can soak for 
awhile and give us a chance to fully discover the secondary effects. 
Removing the unused and “not-allowed” cpus from the topology means a 
substantial scrub of the code base in a number of places, and your 
patch doesn’t really get them all. It’s going to take time to ensure 
everything is working correctly again.


HTH
Ralph

On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet 
<mailto:gilles.gouaillar...@gmail.com>> wrote:


Jeff,

there are quite a lot of changes, I did not update master yet (need 
extra pairs of eyes to review this...)
so unless you want to make rc2 today and rc3 a week later, it is 
imho way safer to wait for v1.10.2


Ralph,
any thoughts ?

Cheers,

Gilles

On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:


Is this something that needs to go into v1.10.1?

If so, a PR needs to be filed ASAP.  We were supposed to make
the next 1.10.1 RC yesterday, but slipped to today due to some
last second patches.


> On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet
> wrote:
>
> Marcin,
>
> here is a patch for the master, hopefully it fixes all the
issues we discussed
> i will make sure it applies fine vs latest 1.10 tarball from
tomorrow
>
> Cheers,
>
> Gilles
>
    >
> On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
>> Gilles,
>>
>> Yes, it seemed that all was fine with binding in the patched
1.10.1rc1 - thank you. Eagerly waiting for the other patches,
let me know and I will test them later this week.
>>
>> Marcin
>>
>>
>>
>> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
>>> Marcin,
>>>
>>> my understanding is that in this case, patched v1.10.1rc1 is
working just fine.
>>> am I right ?
>>>
>>> I prepared two patches
>>> one to remove the warning when binding on one core if only
one core is available,
>>> an other one to add a warning if the user asks a binding
policy that makes no sense with the required mapping policy
>>>
>>> I will finalize them tomorrow hopefully
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Tuesday, October 6, 2015, marcin.krotkiewski
> wrote:
>>> Hi, Gilles
>>>> you mentionned you had one failure with 1.10.1rc1 and
-bind-to core
>>>> could you please send the full details (script, allocation
and output)
>>>> in your slurm script, you can do
>>>> srun -N $SLURM_NNOD

[OMPI users] UCX and multithreading

2018-04-17 Thread marcin.krotkiewski

Hi, all,

I'm reading in the changelog 3.0.0 that

- Use UCX multi-threaded API in the UCX PML.  Requires UCX 1.0 or later.

Also, the changelog for 3.1.0 it says that

- UCX PML improvements: add multi-threading support.

Could anyone briefly explain, what those mean in terms of the 
functionality / performance? Does this mean that the UCX PML have an 
internal progress thread for asynchronous comm, or is this only related 
to supporting multi-threaded MPI applications?


BTW, is there any plans to support progress threads in the future?

Thanks!

Marcin Krotkiewski

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] OSHMEM: shmem_ptr always returns NULL

2018-04-18 Thread marcin.krotkiewski

Hi,

I'm running the below example from the OpenMPI documentation:

#include 
#include 

main()
{
  static int bigd[100];
  int *ptr;
  int i;
  shmem_init();
  if (shmem_my_pe() == 0) {
    /* initialize PE 1’s bigd array */
    ptr = shmem_ptr(bigd, 1);
    if(!ptr){
  fprintf(stderr, "get external pointer failed!\n");
  shmem_global_exit(-1);
    }
    for (i=0; i<100; i++)
  *ptr++ = i+1;
  }
  shmem_barrier_all();
  if (shmem_my_pe() == 1) {
    printf("bigd on PE 1 is:\n");
    for (i=0; i<100; i++)
  printf(" %d\n",bigd[i]);
    printf("\n");
  }
}

but shmem_ptr always returns NULL for me. I tried with OpenMPI versions 
from 2.0.1 up to 3.1.0rc4, compiled with HPCX 2.1, running on a 
ConnectX-4 system. This is the command line:


$ shmemrun -mca spml ucx -mca spml_base_verbose 100 -np 2 -map-by node 
-report-bindings ./a.out


[c11-1:36505] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[c11-2:105580] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[c11-1:36522] mca: base: components_register: registering framework spml 
components

[c11-1:36522] mca: base: components_register: found loaded component ucx
[c11-1:36522] mca: base: components_register: component ucx register 
function successful

[c11-1:36522] mca: base: components_open: opening spml components
[c11-1:36522] mca: base: components_open: found loaded component ucx
[c11-2:105590] mca: base: components_register: registering framework 
spml components

[c11-2:105590] mca: base: components_register: found loaded component ucx
[c11-2:105590] mca: base: components_register: component ucx register 
function successful

[c11-2:105590] mca: base: components_open: opening spml components
[c11-2:105590] mca: base: components_open: found loaded component ucx
[c11-1:36522] mca: base: components_open: component ucx open function 
successful
[c11-2:105590] mca: base: components_open: component ucx open function 
successful
[c11-1:36522] base/spml_base_select.c:107 - mca_spml_base_select() 
select: initializing spml component ucx
[c11-1:36522] spml_ucx_component.c:173 - mca_spml_ucx_component_init() 
in ucx, my priority is 21
[c11-2:105590] base/spml_base_select.c:107 - mca_spml_base_select() 
select: initializing spml component ucx
[c11-2:105590] spml_ucx_component.c:173 - mca_spml_ucx_component_init() 
in ucx, my priority is 21
[c11-1:36522] spml_ucx_component.c:184 - mca_spml_ucx_component_init() 
*** ucx initialized 
[c11-1:36522] base/spml_base_select.c:119 - mca_spml_base_select() 
select: init returned priority 21
[c11-1:36522] base/spml_base_select.c:160 - mca_spml_base_select() 
selected ucx best priority 21
[c11-1:36522] base/spml_base_select.c:194 - mca_spml_base_select() 
select: component ucx selected

[c11-1:36522] spml_ucx.c:82 - mca_spml_ucx_enable() *** ucx ENABLED 
[c11-2:105590] spml_ucx_component.c:184 - mca_spml_ucx_component_init() 
*** ucx initialized 
[c11-2:105590] base/spml_base_select.c:119 - mca_spml_base_select() 
select: init returned priority 21
[c11-2:105590] base/spml_base_select.c:160 - mca_spml_base_select() 
selected ucx best priority 21
[c11-2:105590] base/spml_base_select.c:194 - mca_spml_base_select() 
select: component ucx selected

[c11-2:105590] spml_ucx.c:82 - mca_spml_ucx_enable() *** ucx ENABLED 
[c11-1:36522] spml_ucx.c:305 - mca_spml_ucx_add_procs() *** ADDED PROCS ***
[c11-2:105590] spml_ucx.c:305 - mca_spml_ucx_add_procs() *** ADDED PROCS ***
shared_mr flags are not supported
shared_mr flags are not supported
get external pointer failed!


So it looks like everything is fine - maybe except the 'shared_mr flags 
are not supported' message.


Does anyone have ideas why I get NULL? The same happens if I start two 
ranks on the same compute node, and if I use shmem_malloc'ed pointer 
instead of a static array.


Thank you,

Marcin

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI + custom glibc, a mini HOWTO

2018-05-22 Thread marcin.krotkiewski

Hi, all

I have gone through some sweat to compile OpenMPI against a custom 
(non-native) glibc. The reason I need this is that GCC can use the 
vectorized libm, which came in glibc 2.22. And of course no HPC OS ships 
with v2.22 - they are all behind a few years!


While using a custom Glibc for a single user program is doable, having a 
working OpenMPI environment posed a challenge. I have attached a PDF 
with a short description of the procedure, in case someone finds it 
useful as well. I'd appreciate comments / suggestions as to what can be 
done better, and what I might have overlooked. The subject is tricky, 
and I might well have missed something.


Thanks!

Marcin




openmpi_glibc.pdf
Description: Adobe PDF document
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread marcin.krotkiewski

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A 
simple R script, which starts a few tasks, hangs at the end on 
diconnect. Here is the script:


library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll 
^hcoll R --slave < mk.R


Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are 
spawned by R dynamically inside the script. So I ran into a number of 
issues here:


1. with HPCX it seems that dynamic starting of ranks is not supported, 
hence I had to turn off all of yalla/mxm/hcoll


--
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic 
process functionality

--

2. when I do that, the program does create a 'cluster' and starts the 
ranks, but hangs in PMIx at MPI Disconnect. Here is the top of the trace 
from gdb:


#0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
client/pmix_client_connect.c:232
#2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
ext2x_client.c:1432
#3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at 
dpm/dpm.c:596
#4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
pcomm_disconnect.c:67
#5  0x7f66a16799e9 in mpi_comm_disconnect () from 
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x7f66b2563de5 in do_dotcall () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x7f66b25a207b in bcEval () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x7f66b25b2c62 in R_execClosure () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so


Might this also be related to the dynamic rank creation in R?

Thanks!

Marcin

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread marcin.krotkiewski

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in 
R. Running the R code with -mca pmix_base_verbose 1 i see that each rank 
calls ext2x:client disconnect twice (each PID prints the line twice)


[...]
    3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this 
be the reason for the hang?


Thanks!

Marcin


On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:

Try running the attached example dynamic code - if that works, then it likely 
is something to do with how R operates.






On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski  
wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A simple 
R script, which starts a few tasks, hangs at the end on diconnect. Here is the 
script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R 
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned by 
R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence I 
had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

   MPI function: MPI_Comm_spawn
   Reason:   the Yalla (MXM) PML does not support MPI dynamic process 
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks, 
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
client/pmix_client_connect.c:232
#2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
ext2x_client.c:1432
#3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at dpm/dpm.c:596
#4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
pcomm_disconnect.c:67
#5  0x7f66a16799e9 in mpi_comm_disconnect () from 
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x7f66b2563de5 in do_dotcall () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x7f66b25a207b in bcEval () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x7f66b25b2c62 in R_execClosure () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so

Might this also be related to the dynamic rank creation in R?

Thanks!

Marcin

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread marcin.krotkiewski
huh. This code also runs, but it also only displays 4 connect / 
disconnect messages. I should add that the test R script shows 4 
connect, but 8 disconnect messages. Looks like a bug to me, but where? I 
guess we will try to contact R forums and ask there.


Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this 
case I get a warning about fork being used:


--
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:  [[36000,2],1] (PID 23617)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--

And the process hangs as well - no change.

Marcin



On 06/04/2018 05:27 PM, r...@open-mpi.org wrote:

It might call disconnect more than once if it creates multiple communicators. 
Here’s another test case for that behavior:






On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:

Just out of curiosity, but would using Rmpi and/or doMPI help in any way?

-- bennet


On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
 wrote:

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)

[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?

Thanks!

Marcin


On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:

Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.





On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
 wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
Here is the script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
by R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks,
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
client/pmix_client_connect.c:232
#2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
ext2x_client.c:1432
#3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
dpm/dpm.c:596
#4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
pcomm_disconnect.c:67
#5  0x7f66a16799e9 in mpi_comm_disconnect () from
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x7f66b2563de5 in do_dotcall () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x7f66b25a207b in bcEval () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x7f66b25b2c62 in R_execClosure () from
/cluster/software/R/3.5.0/

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-05 Thread marcin.krotkiewski
Well, I tried with 3.0.1, and it also hangs. I guess we will try to 
write to the R community about this.


m


On 06/04/2018 11:42 PM, Ben Menadue wrote:

Hi All,

This looks very much like what I reported a couple of weeks ago with 
Rmpi and doMPI — the trace looks the same.  But as far as I could see, 
doMPI does exactly what simple_spawn.c does — use MPI_Comm_spawn to 
create the workers and then MPI_Comm_disconnect them when you call 
closeCluster, and it’s here that it hung.


Ralph suggested trying master, but I haven’t had a chance to try this 
yet. I’ll try it today and see if it works for me now.


Cheers,
Ben


On 5 Jun 2018, at 6:28 am, r...@open-mpi.org <mailto:r...@open-mpi.org> 
wrote:


Yes, that does sound like a bug - the #connects must equal the 
#disconnects.



On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


huh. This code also runs, but it also only displays 4 connect / 
disconnect messages. I should add that the test R script shows 4 
connect, but 8 disconnect messages. Looks like a bug to me, but 
where? I guess we will try to contact R forums and ask there.


Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In 
this case I get a warning about fork being used:


--
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:  [[36000,2],1] (PID 23617)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--

And the process hangs as well - no change.

Marcin



On 06/04/2018 05:27 PM, r...@open-mpi.org wrote:

It might call disconnect more than once if it creates multiple communicators. 
Here’s another test case for that behavior:




On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:

Just out of curiosity, but would using Rmpi and/or doMPI help in any way?

-- bennet


On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
  wrote:

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)

[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?

Thanks!

Marcin


On 06/04/2018 03:16 PM,r...@open-mpi.org  wrote:

Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.





On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
  wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
Here is the script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
by R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks,
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 inpthread_cond_wait@@GLIBC_2.3.2  () from
/lib64/libp