from:"marcin.krotkiewski"


Thanks a lot, that looks right! Looks like some reading to do..

Do you know if in the OpenMPI implementation the MPI_T-interfaced MCA 
settings are thread-local, or rank-local?


Thanks!

Marcin


On 09/15/2015 07:58 PM, Nathan Hjelm wrote:

You can use MPI_T to set any MCA variable before MPI_Init. At this time
we lock down all MCA variable during MPI_Init. You will need to call
MPI_T_init_thread before MPI_Init and make sure to call MPI_T_finalize
any time after you are finished setting MCA variables. For more
information see MPI-3.1 chapter 14.

-Nathan

On Tue, Sep 15, 2015 at 07:40:56PM +0200, marcin.krotkiewski wrote:

I was wondering if it is possible, or considered to make it possible to
change the various MCA parameters by individual ranks during runtime in
addition to the command line?

I tried to google a bit, but did not get any indication that such topic has
even been discussed. It would be a very useful thing, especially in
multi-threaded applications when using MPI_THREAD_MULTIPLE, but I could come
up with plenty uses in usual single-threaded ranks setups.

Marcin
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27576.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27578.php

[OMPI users] bug in MPI_Comm_accept?

I have run into a freeze / potential bug when using MPI_Comm_accept in a 
simple client / server implementation. I have attached two simplest 
programs I could produce:


 1. mpi-receiver.c opens a port using MPI_Open_port, saves the port 
name to a file


 2. mpi-receiver enters infinite loop and waits for connections using 
MPI_Comm_accept


 3. mpi-sender.c connects to that port using MPI_Comm_connect, sends 
one MPI_UNSIGNED_LONG, calls barrier and disconnects using 
MPI_Comm_disconnect


 4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier 
and disconnects using MPI_Comm_disconnect and goes to point 2 - infinite 
loop


All works fine, but only exactly 5 times. After that the receiver hangs 
in MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I 
have tried with Intel MPI - no such problem.


I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong, or 
is it some problem with internal state of OpenMPI?


Thanks a lot!

Marcin

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
  MPI_Info info;
  char port_name[MPI_MAX_PORT_NAME];
  MPI_Comm intercomm;

  MPI_Init(&argc, &argv);
  MPI_Info_create(&info);
  MPI_Open_port(info, port_name);
  printf("port name: %s\n", port_name);

  /* write port name to file */   
  {
FILE *fd;
fd = fopen("port.txt", "w+");
fprintf(fd, "%s", port_name);
fclose(fd);
  }

  /* accept connections */
  while(1){
unsigned long data;

/* accept connection */
MPI_Comm_accept(port_name, info, 0, MPI_COMM_WORLD, &intercomm);

/* receive comm size from the sender */
MPI_Recv(&data, 1, MPI_UNSIGNED_LONG, 0, 1, intercomm, MPI_STATUS_IGNORE);
printf("received data: %lx\n", data);

MPI_Barrier(intercomm);
MPI_Comm_disconnect(&intercomm);
printf("client disconnected\n");   
  }
}
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  char port_name[MPI_MAX_PORT_NAME+1];
  MPI_Info info;
  MPI_Comm intercomm;
  unsigned long data = 0x12345678;

  /* initialize MPI */
  MPI_Init(&argc, &argv);
  MPI_Info_create(&info);

  /* connect to receiver ranks - port is a string parameter */
  strcpy(port_name, argv[1]);

  /* connect to server - intercomm is the remote communicator */
  MPI_Comm_connect(port_name, info, 0, MPI_COMM_WORLD, &intercomm);
  printf("** connected\n");

  /* send data */
  MPI_Send(&data, 1, MPI_UNSIGNED_LONG, 0, 1, intercomm);
  MPI_Barrier(intercomm);

  /* disconnect */
  MPI_Comm_disconnect(&intercomm);
  MPI_Finalize();
  printf("** disconnected\n");

  return 0;
}

Re: [OMPI users] bug in MPI_Comm_accept?



I have removed the MPI_Barrier, to no avail. Same thing happens. Adding 
verbosity, before the receiver hangs I get the following message


[node2:03928] mca: bml: Using openib btl to [[12620,1],0] on node node3

So It is somewhere in the openib btl module

Marcin


On 09/16/2015 04:34 PM, Jalel Chergui wrote:
Right, anyway Finalize is necessary at the end of the receiver. The 
other issue is Barrier which is invoked probably when the sender has 
exited hence changing the size of intercom. Can you comment that line 
in both files ?


Jalel

Le 16/09/2015 16:22, Marcin Krotkiewski a écrit :
But where would I put it? If I put it in the while(1), then 
MPI_Comm_Accept cannot be called for the second time. If I put it 
outside of the loop it will never be called.



On 09/16/2015 04:18 PM, Jalel Chergui wrote:

Can you check with an MPI_Finalize in the receiver ?
Jalel

Le 16/09/2015 16:06, marcin.krotkiewski a écrit :
I have run into a freeze / potential bug when using MPI_Comm_accept 
in a simple client / server implementation. I have attached two 
simplest programs I could produce:


 1. mpi-receiver.c opens a port using MPI_Open_port, saves the port 
name to a file


 2. mpi-receiver enters infinite loop and waits for connections 
using MPI_Comm_accept


 3. mpi-sender.c connects to that port using MPI_Comm_connect, 
sends one MPI_UNSIGNED_LONG, calls barrier and disconnects using 
MPI_Comm_disconnect


 4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls 
barrier and disconnects using MPI_Comm_disconnect and goes to point 
2 - infinite loop


All works fine, but only exactly 5 times. After that the receiver 
hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100% 
repeatable. I have tried with Intel MPI - no such problem.


I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth 
wrong, or is it some problem with internal state of OpenMPI?


Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27585.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, Bât. 508 - BP 133, 91403 Orsay cedex, FRANCE
  Tél: (33 1) 69 85 81 27 ; Télécopie: (33 1) 69 85 80 88
  Mél:jalel.cher...@limsi.fr  ; Référence:http://perso.limsi.fr/chergui
**


___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27586.php




___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/09/27587.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, Bât. 508 - BP 133, 91403 Orsay cedex, FRANCE
  Tél: (33 1) 69 85 81 27 ; Télécopie: (33 1) 69 85 80 88
  Mél:jalel.cher...@limsi.fr  ; Référence:http://perso.limsi.fr/chergui
**


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27588.php

Re: [OMPI users] bug in MPI_Comm_accept? (UNCLASSIFIED)


Thank you all for your replies.

I have now tested the code with various setups and versions. First of 
all, the tcp btl seems to work fine (I had patience to check ~10 runs), 
the openib is the problem. I have also compiled using the Intel compiler 
and the story is the same as when using gcc.


I have then tested many openmpi versions from 1.7.5 to 1.10.0 using 
bisection ;) Versions up to and including 1.8.3 worked fine (at least 
above 5 times and around 10), the problem was likely introduced in 
version 1.8.4. Actually, version 1.8.4 was the only one to spit out some 
interesting warning on the receiver side at the moment it hang:


[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one 
event_base_loop can run on each event_base at once.


which may or may not be of importance in this particular case ;)

So to summarize, problem appeared in openib btl in version 1.8.4.

Does anybody have any more ideas?

Thanks!

Marcin



On 09/16/2015 05:59 PM, Burns, Andrew J CTR USARMY RDECOM ARL (US) wrote:

CLASSIFICATION: UNCLASSIFIED

Have you attempted using 2 cores per process? I have noticed that 
MPI_Comm_accept sometimes behaves strangely on single core variations.

I have a program that makes use of Comm_accept/connect and I also call 
MPI_Comm_merge. So, you may want to look into that call as well.

-Andrew Burns

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jalel Chergui
Sent: Wednesday, September 16, 2015 11:49 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] bug in MPI_Comm_accept?

This email was sent from a non-Department of Defense email account, and 
contained active links. All links are disabled, and require you to copy and 
paste the address to a Web browser. Please verify the identity of the sender, 
and confirm authenticity of all links contained within the message.



  
With openmpi-1.7.5, the sender segfaults.

Sorry, I cannot see the problem in the codes. Perhaps people out there may help.

Jalel


Le 16/09/2015 16:40, marcin.krotkiewski a ?crit :

I have removed the MPI_Barrier, to no avail. Same thing happens. Adding 
verbosity, before the receiver hangs I get the following message

[node2:03928] mca: bml: Using openib btl to [[12620,1],0] on node node3

So It is somewhere in the openib btl module

Marcin


On 09/16/2015 04:34 PM, Jalel Chergui wrote:
Right, anyway Finalize is necessary at the end of the receiver. The other issue 
is Barrier which is invoked probably when the sender has exited hence changing 
the size of intercom. Can you comment that line in both files ?

Jalel

Le 16/09/2015 16:22, Marcin Krotkiewski a ?crit :
But where would I put it? If I put it in the while(1), then MPI_Comm_Accept 
cannot be called for the second time. If I put it outside of the loop it will 
never be called.


On 09/16/2015 04:18 PM, Jalel Chergui wrote:
Can you check with an MPI_Finalize in the receiver ?
Jalel

Le 16/09/2015 16:06, marcin.krotkiewski a ?crit :
I have run into a freeze / potential bug when using MPI_Comm_accept in a simple 
client / server implementation. I have attached two simplest programs I could 
produce:

  1. mpi-receiver.c opens a port using MPI_Open_port, saves the port name to a 
file

  2. mpi-receiver enters infinite loop and waits for connections using 
MPI_Comm_accept

  3. mpi-sender.c connects to that port using MPI_Comm_connect, sends one 
MPI_UNSIGNED_LONG, calls barrier and disconnects using MPI_Comm_disconnect

  4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier and 
disconnects using MPI_Comm_disconnect and goes to point 2 - infinite loop

All works fine, but only exactly 5 times. After that the receiver hangs in 
MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I have 
tried with Intel MPI - no such problem.

I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong, or is it 
some problem with internal state of OpenMPI?

Thanks a lot!

Marcin




___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
Caution-www.open-mpi.org/community/lists/users/2015/09/27585.php


--
**
  Jalel CHERGUI, LIMSI-CNRS, B?t. 508 - BP 133, 91403 Orsay cedex, FRANCE
  T?l: (33 1) 69 85 81 27 ; T?l?copie: (33 1) 69 85 80 88
  M?l: jalel.cher...@limsi.fr<mailto:jalel.cher...@limsi.fr> ; R?f?rence: 
Caution-perso.limsi.fr/chergui
**




___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
Caution-www.open-mpi.o

[OMPI users] Using POSIX shared memory as send buffer

2015-09-27 Thread marcin.krotkiewski


Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX 
shared memory region (/dev/shm). The memory is shared among many MPI 
processes within the same compute node. Essentially, I see a bit hectic 
performance, but it seems that my code it is roughly twice slower than 
when using a usual, malloced send buffer.


I was wondering - has any of you had experience with sending SHM over 
Infiniband? why would I see so much worse results? Is it e.g., because 
this memory cannot be pinned and OpenMPI is reallocating it? Or is it 
some OS peculiarity?


I would appreciate any hints at all. Thanks a lot !

Marcin

Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread marcin.krotkiewski

I've now run a few more tests and I think I can reasonably confidently
say that the read only mmap is a problem. Let me know if you have a
possible fix - I will gladly test it.

Marcin

On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

We register the memory with the NIC for both read and write access. This
may be the source of the slowdown. We recently added internal support to
allow the point-to-point layer to specify the access flags but the
openib btl does not yet make use of the new support. I plan to make the
necessary changes before the 2.0.0 release. I should have them complete
later this week. I can send you a note when they are ready if you would
like to try it and see if it addresses the problem.

-Nathan

On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

Thanks, Dave.

I have verified the memory locality and IB card locality, all's fine.

Quite accidentally I have found that there is a huge penalty if I mmap the
shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
although I must look at this further. I'll report when I am certain, in case
sb finds this useful.

Is this an OS feature, or is OpenMPI somehow working differently? I don't
suspect you guys write to the send buffer, right? Even if you would there
would be a segfault. So I guess this could be OS preventing any writes to
the pointer that introduced the overhead?

Marcin

On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski
wrote:

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX shared
memory region (/dev/shm). The memory is shared among many MPI processes within
the same compute node. Essentially, I see a bit hectic performance, but it
seems that my code it is roughly twice slower than when using a usual, malloced
send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your shared
memory vs. your private (malloced) memory. If you have a multi-NUMA-domain system (i.e.,
any 2+ socket server, and even some single-socket servers) then you are likely to run
into this sort of issue. The PCI bus on which your IB HCA communicates is almost
certainly closer to one NUMA domain than the others, and performance will usually be
worse if you are sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, though I don't
know if it knows how to show memory affinity. I think you can find memory affinity for a process via
"/proc//numa_maps". There's lots of info about NUMA affinity here:
https://queue.acm.org/detail.cfm?id=2513149

-Dave

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27705.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27711.php

Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread marcin.krotkiewski


Thank you, and Jeff, for clarification.

Before I bother you all more without the need, I should probably say I 
was hoping to use libfabric/OpenMPI on an InfiniBand cluster. Somehow 
now I feel I have confused this altogether, so maybe I should go one 
step back:


 1. libfabric is hardware independent, and does support Infiniband, right?
 2. I read that OpenMPI provides interface to libfabric through 
btl/usnic and mtl/ofi.  can any of those use libfabric on Infiniband 
networks?


Please forgive my ignorance, the amount of different options is rather 
overwhelming..


Marcin



On 09/30/2015 04:26 PM, Howard Pritchard wrote:


Hello Marcin

What configure options are you using besides with-libfabric?

Could you post your config.log file tp the list?

Looks like you only install fi_ext_usnic.h if you could build the 
usnic libfab provider.  When you configured libfabric what providers 
were listed at the end of configure run? Maybe attach config.log from 
the libfabric build ?


If your cluster has cisco usnics you should probably be using 
libfabric/cisco openmpi.  If you are using intel omnipath you may want 
to try the ofi mtl.  Its not selected by default however.


Howard

--

sent from my smart phonr so no good type.

Howard

On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski" 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Hi,

I am trying to compile the 2.x branch with libfabric support, but
get this error during configure:

configure:100708: checking rdma/fi_ext_usnic.h presence
configure:100708: gcc -E
-I/cluster/software/VERSIONS/openmpi.gnu.2.x/include

-I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
conftest.c
conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such file
or directory
[...]
configure:100708: checking for rdma/fi_ext_usnic.h
configure:100708: result: no
configure:101253: checking if MCA component btl:usnic can compile
configure:101255: result: no

Which is correct - the file is not there. I have downloaded fresh
libfabric-1.1.0.tar.bz2 and it does not have this file. Probably
OpenMPI needs some updates?

I am also wondering what is the state of libfabric support in
OpenMPI nowadays. I have seen recent (March) presentation about
it, so it seems to be an actively developed feature. Is this
correct? It seemed from the presentation that there are benefits
to this approach, but is it mature enough in OpenMPI, or it will
yet take some time?

Thanks!

Marcin
___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27728.php



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27733.php

Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-30 Thread marcin.krotkiewski

Hi, Nathan

I have compiled 2.x with your patch. I must say it works _much_ better
with your changes. I have no idea how you figured that out! A short
table with my bandwidth calculations (MB/s)

PROT_READPROT_READ | PROT_WRITE
1.10.025005700
2.x+patch 4800-52005700

That is not a very thorough study, but essentially I was getting
2500MB/s with read-only shm. With your patch it is somewhat shaky (very
rarely I get 2500 also), but most of the time it is around 5000MB/s.

Seems mmaping the memory read-write still yields marginally better
results. Again, I do not have very solid data to support it - just a
bunch of runs.

Do you have an idea as to why such performance difference exists?

Thanks a lot!

Marcin

On 09/30/2015 12:37 AM, Nathan Hjelm wrote:

There was a bug in that patch that affected IB systems. Updated patch:

https://github.com/hjelmn/ompi/commit/c53df23c0bcf8d1c531e04d22b96c8c19f9b3fd1.patch

-Nathan

On Tue, Sep 29, 2015 at 03:35:21PM -0600, Nathan Hjelm wrote:

I have a branch with the changes available at:

https://github.com/hjelmn/ompi.git

in the mpool_update branch. If you prefer you can apply this patch to
either a 2.x or a master tarball.

https://github.com/hjelmn/ompi/commit/8839dbfae85ba8f443b2857f9bbefdc36c4ebc1a.patch

Let me know if this resolves the performance issues.

-Nathan

On Tue, Sep 29, 2015 at 09:57:54PM +0200, marcin.krotkiewski wrote:

I've now run a few more tests and I think I can reasonably confidently say
that the read only mmap is a problem. Let me know if you have a possible
fix - I will gladly test it.

Marcin

On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

-Nathan

On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

Thanks, Dave.

I have verified the memory locality and IB card locality, all's fine.

Marcin

On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski
wrote:

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX
shared memory region (/dev/shm). The memory is shared among many MPI processes
within the same compute node. Essentially, I see a bit hectic performance, but
it seems that my code it is roughly twice slower than when using a usual,
malloced send buffer.

-Dave

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27705.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:

Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread marcin.krotkiewski



Thank you for this clear explanation. I do not have True Scale on 'my' 
machine, so unless Mellanox gets involved - no juice for me.


Makes me wonder. libfabric is marketed as a next-generation solution. 
Clearly it has some reported advantage for Cisco usnic, but since you 
claim no improvement over psm, then I guess it is nothing to look 
forward to, is it?


Anyway, thanks a lot for clearing this up

Marcin


On 09/30/2015 08:13 PM, Howard Pritchard wrote:

Hi Marcin,


2015-09-30 9:19 GMT-06:00 marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>>:


Thank you, and Jeff, for clarification.

Before I bother you all more without the need, I should probably
say I was hoping to use libfabric/OpenMPI on an InfiniBand
cluster. Somehow now I feel I have confused this altogether, so
maybe I should go one step back:

 1. libfabric is hardware independent, and does support
Infiniband, right?


The short answer is yes libfabric is hardware independent (and does 
work on goods days on os-x as well as linux).
The longer answer is that there has been more/less work on 
implementing providers (the plugins in to libfabric

to interface to different networks) for different networks.

There is a socket provider.  That gets a good amount of attention 
because its a base reference provider.
psm/psm2 providers are available.  I have used the psm provider some 
on a truescale cluster.  It doesn't
offer better performance than just using psm directly, but it does 
appear to work.


There is an mxm provider but it was not implemented by mellanox, and I 
can't get it to compile on my

connectx3 system using mxm 1.5.

There is a vanilla verbs provider but it doesn't support FI_EP_RDM 
endpoint type, which is used by

the non-cisco component of Open MPI (ofi mtl) which is available.

When you build and install libfabric, there should be an fi_info 
binary installed in $(LIBFABRIC_INSTALL_DIR)/bin

On my truescale cluster the output is:

psm: psm

version: 0.9

type: FI_EP_RDM

protocol: FI_PROTO_PSMX

verbs: IB-0x80fe

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_RDMA_CM_IB_RC

sockets: IP

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_DGRAM

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_RDM

protocol: FI_PROTO_SOCK_TCP

In order to use the mtl/ofi, at a minimum a provider needs to support 
FI_EP_RDM type (see above).  Note that on the truescale
cluster the verbs provider is built, but it only supports FI_EP_MSG 
endpoint types.  So mtl/ofi can't use that.


 2. I read that OpenMPI provides interface to libfabric through
btl/usnic and mtl/ofi.  can any of those use libfabric on
Infiniband networks?


if you have intel truescale or its follow-on then the answer is yes, 
although the default is for Open MPI to use mtl/psm on that network.



Please forgive my ignorance, the amount of different options is
rather overwhelming..

Marcin



On 09/30/2015 04:26 PM, Howard Pritchard wrote:


Hello Marcin

What configure options are you using besides with-libfabric?

Could you post your config.log file tp the list?

Looks like you only install fi_ext_usnic.h if you could build the
usnic libfab provider.  When you configured libfabric what
providers were listed at the end of configure run? Maybe attach
config.log from the libfabric build ?

If your cluster has cisco usnics you should probably be using
libfabric/cisco openmpi. If you are using intel omnipath you may
want to try the ofi mtl.  Its not selected by default however.

Howard

--

sent from my smart phonr so no good type.

Howard

On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski"
mailto:marcin.krotkiew...@gmail.com>> wrote:

Hi,

I am trying to compile the 2.x branch with libfabric support,
but get this error during configure:

configure:100708: checking rdma/fi_ext_usnic.h presence
configure:100708: gcc -E
-I/cluster/software/VERSIONS/openmpi.gnu.2.x/include

-I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
conftest.c
conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such
file or directory
[...]
configure:100708: checking for rdma/fi_ext_usnic.h
configure:100708: result: no
configure:101253: checking if MCA component btl:usnic can compile
configure:101255: result: no

Which is correct - the file is not there. I have downloaded
fresh libfabric-1.1.0.tar.bz2 and it does not have this file.
Probably OpenMPI needs some updates?

I am also wondering what is the state of libfabric support in
OpenMPI nowadays. I have seen recent (March) presentation
about

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes


Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster 
nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
processes running on a given cluster node. Compute nodes are 2-socket, 
8-core E5-2670 systems with HyperThreading on


node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node   0   1
  0:  10  21
  1:  21  10

I run MPI program with command

mpirun  --report-bindings --bind-to core -np 64 ./affinity

The program simply runs sched_getaffinity for each process and prints 
out the result.


---
TEST RUN 1
---
For this particular job the problem is more severe: openmpi fails to run 
at all with error


--
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:c6-6
  Application name:  ./affinity
  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:  odls_default_module.c:551
--

This is SLURM environment variables:

SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'

There is also a lot of warnings like

[compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
available processors)



---
TEST RUN 2
---

In another allocation I got a different error

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c6-19
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

and the allocation was the following

SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'


If in this case I run on only 32 cores

mpirun  --report-bindings --bind-to core -np 32 ./affinity

the process starts, but I get the original binding problem:

[compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
available processors)


Running with --hetero-nodes yields exactly the same results





Hope the above is useful. The problem with binding under SLURM with CPU 
cores spread over nodes seems to be very reproducible. It is actually 
very often that OpenMPI dies with some error like above. These tests 
were run with openmpi-1.8.8 and 1.10.0, both giving same results.


One more suggestion. The warning message (MCW rank 8 is not bound...) is 
ONLY displayed when I use --report-bindings. It is never shown if I 
leave out this option, and although the binding is wrong the user is not 
notified. I think it would be better to show this warning in all cases 
binding fails.


Let me know if you need more information. I can help to debug this - it 
is a rather crucial issue.


Thanks!

Marcin






On 10/02/2015 11:49 PM, Ralph Castain wrote:

Can you please send me the allocation request you made (so I can see what you 
specified on the cmd line), and the mpirun cmd line?

Thanks
Ralph


On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski  
wrote:

Hi,

I fail to make OpenMPI bind to cores correctly when running from within 
SLURM-allocated CPU resources spread over a range of compute nodes in an 
otherwise homogeneous cluster. I have found this thread

http://www.open-mpi.org/community/lists/users/2014/06/24682.php

and did try to use what Ralph suggested there (--hetero-nodes), but it does not 
work (v. 1.10.0). When running with --report-bindings I get messages like

[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available 
processors)

for all ranks outside

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes



On 10/03/2015 01:06 PM, Ralph Castain wrote:

Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?
Not to the best of my knowledge, and at least not intentionally. SLURM 
starts as many processes as there are physical cores, not threads. To 
verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go onto 
c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, sched_getaffinity 
in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no binding):

rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,
rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,

rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
27, 28, 29, 30, 31,

rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
26, 27, 28, 29, 30,
rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,
rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 30, 31,



Still looks ok to me. If I now turn the binding on, openmpi fails:


--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c1-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The above tests were done with 1.10.1rc1, so it does not fix the problem.

Marcin



I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.



On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski  
wrote:

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster nodes. 
SLURM uses cgroups to limit t

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes



On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course get 
the right mapping as we’ll just inherit whatever we received. 
Yes. I meant that whatever you received (what SLURM gives) is a correct 
cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In 
the case mentioned earlier openmpi should start 6 tasks on c1-30. If HT 
would be treated as separate and independent cores, sched_getaffinity of 
an MPI process started on c1-30 would return a map with 6 entries only. 
In my case it returns a map with 12 entries - 2 for each core. So one  
process is in fact allocated both HTs, not only one. Is what I'm saying 
correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores. 
How do you mean? Is the above understanding wrong? I would expect that 
on c1-30 with --bind-to core openmpi should bind to logical cores 0 and 
16 (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
available in sched_getaffinity map, and there is twice as many logical 
cores as there are MPI processes started on the node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
things. It still might have improved something, but not everything. 
Consider this job:


SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

If I run 32 tasks as follows (with 1.10.1rc1)

mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity

I get the following error:

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:c9-31
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--


If I now use --bind-to core:overload-allowed, then openmpi starts and 
_most_ of the threads are bound correctly (i.e., map contains two 
logical cores in ALL cases), except this case that required the overload 
flag:


rank 15 @ compute-9-31.local   1, 17,
rank 16 @ compute-9-31.local  11, 27,
rank 17 @ compute-9-31.local   2, 18,
rank 18 @ compute-9-31.local  12, 28,
rank 19 @ compute-9-31.local   1, 17,

Note pair 1,17 is used twice. The original SLURM delivered map (no 
binding) on this node is


rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,

Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
Clearly, the original SLURM-delivered map has 5 CPUs included, enough 
for 5 MPI processes.


Cheers,

Marcin




On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that Slurm may be 
treating HTs as “cores” - i.e., as independent cpus. Any chance that 
is true?
Not to the best of my knowledge, and at least not intentionally. 
SLURM starts as many processes as there are physical cores, not 
threads. To verify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go 
onto c1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, 
sched_getaffinity in my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no 
binding):


rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 
1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax is 
a new one to me, but they tend to change things around. Could you run 
lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error message 
is clearly indicating that we are seeing individual HTs (and not 
complete cores) assigned, and I don’t know the source of that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:



On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of course 
get the right mapping as we’ll just inherit whatever we received.
Yes. I meant that whatever you received (what SLURM gives) is a 
correct cpu map and assigns _whole_ CPUs, not a single HT to MPI 
processes. In the case mentioned earlier openmpi should start 6 tasks 
on c1-30. If HT would be treated as separate and independent cores, 
sched_getaffinity of an MPI process started on c1-30 would return a 
map with 6 entries only. In my case it returns a map with 12 entries 
- 2 for each core. So one  process is in fact allocated both HTs, not 
only one. Is what I'm saying correct?


Looking at your output, it’s pretty clear that you are getting 
independent HTs assigned and not full cores.
How do you mean? Is the above understanding wrong? I would expect 
that on c1-30 with --bind-to core openmpi should bind to logical 
cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All those 
logical cores are available in sched_getaffinity map, and there is 
twice as many logical cores as there are MPI processes started on the 
node.


My guess is that something in slurm has changed such that it detects 
that HT has been enabled, and then begins treating the HTs as 
completely independent cpus.


Try changing “-bind-to core” to “-bind-to hwthread 
 -use-hwthread-cpus” and see if that works



I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,

Which means that two ranks have been bound to the same physical core 
(logical cores 0 and 16 are two HTs of the same core). If I use 
--bind-to core, I get the following correct binding


rank 0 @ compute-1-30.local  0, 16,

The problem is many other ranks get bad binding with 'rank XXX is not 
bound (or bound to all available processors)' warning.


But I think I was not entirely correct saying that 1.10.1r

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes



Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings 
--bind-to core -np 32 ./affinity


In case of 1.10.rc1 I have also added :overload-allowed - output in a 
separate file. This option did not make much difference for 1.10.0, so I 
did not attach it here.


First thing I noted for 1.10.0 are lines like

[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS 
NOT BOUND


with an empty BITMAP.

The SLURM environment is

set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

I have submitted an interactive job on screen for 120 hours now to work 
with one example, and not change it for every post :)


If you need anything else, let me know. I could introduce some 
patch/printfs and recompile, if you need it.


Marcin



On 10/03/2015 07:17 PM, Ralph Castain wrote:
Rats - just realized I have no way to test this as none of the 
machines I can access are setup for cgroup-based multi-tenant. Is this 
a debug version of OMPI? If not, can you rebuild OMPI with —enable-debug?


Then please run it with —mca rmaps_base_verbose 10 and pass along the 
output.


Thanks
Ralph


On Oct 3, 2015, at 10:09 AM, Ralph Castain <mailto:r...@open-mpi.org>> wrote:


What version of slurm is this? I might try to debug it here. I’m not 
sure where the problem lies just yet.



On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - 
core 1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
  PCIBridge
PCI 8086:1521
  Net L#0 "eth0"
PCI 8086:1521
  Net L#1 "eth1"
  PCIBridge
PCI 15b3:1003
  Net L#2 "ib0"
  OpenFabrics L#3 "mlx4_0"
  PCIBridge
PCI 102b:0532
  PCI 8086:1d02
Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
  PU L#16 (P#8)
  PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
  PU L#18 (P#9)
  PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
  PU L#20 (P#10)
  PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
  PU L#22 (P#11)
  PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
  PU L#24 (P#12)
  PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
  PU L#26 (P#13)
  PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#31)



On 10/03/2015 05:46 PM, Ralph Castain wrote:
Maybe I’m just misreading your HT map - that slurm nodelist syntax 
is a new one to me, but they tend to change things around. Could 
you run lstopo on one of those compute nodes and send the output?


I’m just suspicious because I’m not seeing a clear pairing of HT 
numbers in your output, but HT numbering is BIOS-specific and I may 
just not be understanding your particular pattern. Our error 
message is clearly indicating that we are seeing individual HTs 
(and not complete cores) assigned, and I don’t know the source of 
that confusion.



On Oct 3, 2015, at 8:28 AM, marcin.k

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-04 Thread marcin.krotkiewski


Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
know the reason, but I think I know when the problem happens (at least 
on 1.10.1rc1). It seems that by default openmpi maps by socket. The 
error happens when for a given compute node there is a different number 
of cores used on each socket. Consider previously studied case (the 
debug outputs I sent in last post). c1-8, which was source of error, has 
5 mpi processes assigned, and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - extra 
core on socket 1 with no corresponding core on socket 0. At that point 
the returned trg_obj points to the first core on the node (os_index 0, 
socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, the 
error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior between 
1.10.1rc1 and earlier versions. In the SLURM job described in last post, 
1.10.1rc1 fails to bind only in 1 case, while the earlier versions fail 
in 21 out of 32 cases. You mentioned there was a bug in hwloc. Not sure 
if it can explain the difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

I suspect ompi tries to bind to threads outside the cpuset.
this could be pretty similar to a previous issue when ompi tried to 
bind to cores outside the cpuset.
/* when a core has more than one thread, would ompi assume all the 
threads are available if the core is available ? */

I will investigate this from tomorrow

Cheers,

Gilles

On Sunday, October 4, 2015, Ralph Castain <mailto:r...@open-mpi.org>> wrote:


Thanks - please go ahead and release that allocation as I’m not
going to get to this immediately. I’ve got several hot irons in
the fire right now, and I’m not sure when I’ll get a chance to
track this down.

Gilles or anyone else who might have time - feel free to take a
gander and see if something pops out at you.

Ralph



On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
>
wrote:


Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and
executed

mpirun --mca rmaps_base_verbose 10 --hetero-nodes
--report-bindings --bind-to core -np 32 ./affinity

In case of 1.10.rc1 I have also added :overload-allowed - output
in a separate file. This option did not make much difference for
1.10.0, so I did not attach it here.

First thing I noted for 1.10.0 are lines like

[login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
[login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON
c1-26 IS NOT BOUND

with an empty BITMAP.

The SLURM environment is

set | grep SLURM
SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluste

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

lt;mailto:gil...@rist.or.jp>> wrote:


Marcin,

i ran a simple test with v1.10.1rc1 under a cpuset with
- one core (two threads 0,16) on socket 0
- two cores (two threads each 8,9,24,25) on socket 1

$ mpirun -np 3 -bind-to core ./hello_c
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:rapid
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

as you already pointed, default mapping is by socket.

so on one hand, we can consider this behavior is a feature :
we try to bind two processes to socket 0, so the --oversubscribe 
option is required

(and it does what it should :
$ mpirun -np 3 -bind-to core --oversubscribe -report-bindings 
./hello_c
[rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
[rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../../../..][BB/../../../../../../..]
[rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI 
gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: 
v1.10.0-84-g15ae63f, Oct 03, 2015, 128)


and on the other hand, we could consider ompi should be a bit 
smarter, and uses socket 1 for task 2 since socket 0 is fully 
allocated and there is room on socket 1.


Ralph, any thoughts ? bug or feature ?


Marcin,

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun

Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do 
not know the reason, but I think I know when the problem happens 
(at least on 1.10.1rc1). It seems that by default openmpi maps by 
socket. The error happens when for a given compute node there is 
a different number of cores used on each socket. Consider 
previously studied case (the debug outputs I sent in last post). 
c1-8, which was source of error, has 5 mpi processes assigned, 
and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. 
Binding progresses correctly up to and including core 13 (see end 
of file out.1.10.1rc2, before the error). That is 2 cores on 
socket 0, and 2 cores on socket 1. Error is thrown when core 14 
should be bound - extra core on socket 1 with no corresponding 
core on socket 0. At that point the returned trg_obj points to 
the first core on the node (os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in 
such situation. Moreover, if I now use --map-by core instead of 
socket, the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior 
bet

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes


Hi, Gilles

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
Cpus_allowed_list /proc/self/status

before invoking mpirun


It was an interactive job allocated with

salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

The slurm environment is the following

SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

The output of the command you asked for is

0: c1-2.local  Cpus_allowed_list:1-4,17-20
1: c1-4.local  Cpus_allowed_list:1,15,17,31
2: c1-8.local  Cpus_allowed_list:0,5,9,13-14,16,21,25,29-30
3: c1-13.local  Cpus_allowed_list:   3-7,19-23
4: c1-16.local  Cpus_allowed_list:   12-15,28-31
5: c1-23.local  Cpus_allowed_list:   2-4,8,13-15,18-20,24,29-31
6: c1-26.local  Cpus_allowed_list:   1,6,11,13,15,17,22,27,29,31

Running with command

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
--report-bindings --map-by socket -np 32 ./affinity


I have attached two output files: one for the original 1.10.1rc1, one 
for the patched version.


When I said 'failed in one case' I was not precise. I got an error on 
node c1-8, which was the first one to have different number of MPI 
processes on the two sockets. It would also fail on some later nodes, 
just that because of the error we never got there.


Let me know if you need more.

Marcin








Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
know the reason, but I think I know when the problem happens (at 
least on 1.10.1rc1). It seems that by default openmpi maps by socket. 
The error happens when for a given compute node there is a different 
number of cores used on each socket. Consider previously studied case 
(the debug outputs I sent in last post). c1-8, which was source of 
error, has 5 mpi processes assigned, and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
progresses correctly up to and including core 13 (see end of file 
out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
cores on socket 1. Error is thrown when core 14 should be bound - 
extra core on socket 1 with no corresponding core on socket 0. At 
that point the returned trg_obj points to the first core on the node 
(os_index 0, socket 0).


I have submitted a few other jobs and I always had an error in such 
situation. Moreover, if I now use --map-by core instead of socket, 
the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
1.10.1rc1. However, there is still a difference in behavior between 
1.10.1rc1 and earlier versions. In the SLURM job described in last 
post, 1.10.1rc1 fails to bind only in 1 case, while the earlier 
versions fail in 21 out of 32 cases. You mentioned there was a bug in 
hwloc. Not sure if it can explain the difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

[OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM


Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose 
of cpu binding?



Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. 
This is useful for hybrid jobs, where each MPI process spawns some 
internal worker threads (e.g., OpenMP). The intention is that there are 
2 MPI procs started, each of them 'bound' to 4 cores. SLURM will also 
set an environment variable


SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that 
launches the MPI processes to figure out the cpuset. In case of OpenMPI 
+ mpirun I think something should happen in 
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually 
parsed. Unfortunately, it is never really used...


As a result, cpuset of all tasks started on a given compute node 
includes all CPU cores of all MPI tasks on that node, just as provided 
by SLURM (in the above example - 8). In general, there is no simple way 
for the user code in the MPI procs to 'split' the cores between 
themselves. I imagine the original intention to support this in OpenMPI 
was something like


mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the 
allocated cores between the mpi tasks. Is this right? If so, it seems 
that at this point this is not implemented. Is there plans to do this? 
If no, does anyone know another way to achieve that?


Thanks a lot!

Marcin

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM


Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an 
error:


$ mpirun --map-by core:pe=4 ./affinity
--
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier when we see 
“cpus_per_task”, then we probably should do so as there isn’t any reason to 
make you set it twice (well, other than trying to track which envar slurm is 
using now).



On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski  
wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose of cpu 
binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. This is 
useful for hybrid jobs, where each MPI process spawns some internal worker 
threads (e.g., OpenMP). The intention is that there are 2 MPI procs started, 
each of them 'bound' to 4 cores. SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that launches the 
MPI processes to figure out the cpuset. In case of OpenMPI + mpirun I think 
something should happen in orte/mca/ras/slurm/ras_slurm_module.c, where the 
variable _is_ actually parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node includes all 
CPU cores of all MPI tasks on that node, just as provided by SLURM (in the 
above example - 8). In general, there is no simple way for the user code in the 
MPI procs to 'split' the cores between themselves. I imagine the original 
intention to support this in OpenMPI was something like

mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the allocated 
cores between the mpi tasks. Is this right? If so, it seems that at this point 
this is not implemented. Is there plans to do this? If no, does anyone know 
another way to achieve that?

Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27804.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes


Gilles,

Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 - 
thank you. Eagerly waiting for the other patches, let me know and I will 
test them later this week.


Marcin



On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:

Marcin,

my understanding is that in this case, patched v1.10.1rc1 is working 
just fine.

am I right ?

I prepared two patches
one to remove the warning when binding on one core if only one core is 
available,
an other one to add a warning if the user asks a binding policy that 
makes no sense with the required mapping policy


I will finalize them tomorrow hopefully

Cheers,

Gilles

On Tuesday, October 6, 2015, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


Hi, Gilles

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and
output)
in your slurm script, you can do
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
Cpus_allowed_list /proc/self/status
before invoking mpirun


It was an interactive job allocated with

salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

The slurm environment is the following

SLURM_JOBID=12714491
SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
SLURM_JOB_ID=12714491
SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_JOB_NUM_NODES=7
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=7
SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=32
SLURM_NTASKS=32
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-1.local
SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

The output of the command you asked for is

0: c1-2.local  Cpus_allowed_list:1-4,17-20
1: c1-4.local  Cpus_allowed_list:1,15,17,31
2: c1-8.local  Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30
3: c1-13.local  Cpus_allowed_list:   3-7,19-23
4: c1-16.local  Cpus_allowed_list:   12-15,28-31
5: c1-23.local  Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31
6: c1-26.local  Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31

Running with command

mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core
--report-bindings --map-by socket -np 32 ./affinity

I have attached two output files: one for the original 1.10.1rc1,
one for the patched version.

When I said 'failed in one case' I was not precise. I got an error
on node c1-8, which was the first one to have different number of
MPI processes on the two sockets. It would also fail on some later
nodes, just that because of the error we never got there.

Let me know if you need more.

Marcin








Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I
do not know the reason, but I think I know when the problem
happens (at least on 1.10.1rc1). It seems that by default
openmpi maps by socket. The error happens when for a given
compute node there is a different number of cores used on each
socket. Consider previously studied case (the debug outputs I
sent in last post). c1-8, which was source of error, has 5 mpi
processes assigned, and the cpuset is the following:

0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.
Binding progresses correctly up to and including core 13 (see
end of file out.1.10.1rc2, before the error). That is 2 cores on
socket 0, and 2 cores on socket 1. Error is thrown when core 14
should be bound - extra core on socket 1 with no corresponding
core on socket 0. At that point the returned trg_obj points to
the first core on the node (os_index 0, socket 0).

I have submitted a few other jobs and I always had an error in
such situation. Moreover, if I now use --map-by core instead of
socket, the error is gone, and I get my expected binding:

rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.lo

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Thank you both for your suggestion. I still cannot make this work 
though, and I think - as Ralph predicted - most problems are likely 
related to non-homogeneous mapping of cpus to jobs. But there is 
problems even before that part..


If I reserve one entire compute node with SLURM:

salloc --ntasks=16 --tasks-per-node=16

I can run my code as you suggested with _any_ N (including odd 
numbers!). OpenMPI will figure out the maximun number of tasks that fits 
and launch them. This also works for many complete nodes, but this is 
the only case when I managed to get it to work.


If I specify cpus per task, also allocating one full node

salloc --ntasks=4 --cpus-per-task=4 --tasks-per-node=4

things go astray:

mpirun --map-by slot:pe=4 ./affinity
rank 0 @ compute-1-6.local  0, 1, 2, 3, 16, 17, 18, 19,

Yes, only one MPI process was started. Running what Gilles previously 
suggested:


$ srun grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31
Cpus_allowed_list:0-31

So the allocation seems fine. The SLURM environment is also correct, as 
far as I can tell:


SLURM_CPUS_PER_TASK=4
SLURM_JOB_CPUS_PER_NODE=16
SLURM_JOB_NODELIST=c1-6
SLURM_JOB_NUM_NODES=1
SLURM_NNODES=1
SLURM_NODELIST=c1-6
SLURM_NPROCS=4
SLURM_NTASKS=4
SLURM_NTASKS_PER_NODE=4
SLURM_TASKS_PER_NODE=4

I do not understand why openmpi does not want to start more than 1 
process. If I try to force it (-n 4) I of course get an error:


mpirun --map-by slot:pe=4 -n 4 ./affinity

--
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  ./affinity

Either request fewer slots for your application, or make more slots 
available

for use.
--


For clarity, I will not describe other cases / non-contiguous cpu sets / 
heterogeneous nodes. Clearly something is wrong already with the simple 
ones..


Does anyone have any ideas? Should I record some logs to see what's 
going on?


Thanks a lot!

Marcin






On 10/06/2015 01:04 AM, tmish...@jcity.maeda.co.jp wrote:

Hi Ralph, it's been a long time.

The option "map-by core" does not work when pe=N > 1 is specified.
So, you should use "map-by slot:pe=N" as far as I remember.

Regards,
Tetsuya Mishima

2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
tasks using SLURM」で書きました

Hmmm…okay, try -map-by socket:pe=4

We’ll still hit the asymmetric topology issue, but otherwise this should

work



On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski

 wrote:

Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an

error:

$ mpirun --map-by core:pe=4 ./affinity


--

A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.


--

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier

when we see “cpus_per_task”, then we probably should do so as there isn’t
any reason to make you set it twice (well, other than

trying to track which envar slurm is using now).



On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski

 wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the

purpose of cpu binding?


Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.

This is useful for hybrid jobs, where each MPI process spawns some internal
worker threads (e.g., OpenMP). The intention is

that there are 2 MPI procs started, each of them 'bound' to 4 cores.

SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that

launches the MPI processes to figure out the cpuset. In case of OpenMPI +
mpirun I think something should happen in

orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually

parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node

includes all CPU cores of all MPI tasks on that node, just as provided by
SLURM (in the above example - 8). In general, there is

no simple way for the user code in the MPI procs to 'split' the cores

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM



Thanks, Gilles. This is a good suggestion and I will pursue this 
direction. The problem is that currently SLURM does not support 
--cpu_bind on my system for whatever reasons. I may work towards turning 
this option on if that will be necessary, but it would also be good to 
be able to do it with pure openmpi..


Marcin


On 10/06/2015 08:01 AM, Gilles Gouaillardet wrote:

Marcin,

did you investigate direct launch (e.g. srun) instead of mpirun ?

for example, you can do
srun --ntasks=2 --cpus-per-task=4 -l grep Cpus_allowed_list 
/proc/self/status


note, you might have to use the srun --cpu_bind option, and make sure 
your slurm config does support that :
srun --ntasks=2 --cpus-per-task=4 --cpu_bind=core,verbose -l grep 
Cpus_allowed_list /proc/self/status


Cheers,

Gilles

On 10/6/2015 4:38 AM, marcin.krotkiewski wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the 
purpose of cpu binding?



Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks. 
This is useful for hybrid jobs, where each MPI process spawns some 
internal worker threads (e.g., OpenMP). The intention is that there 
are 2 MPI procs started, each of them 'bound' to 4 cores. SLURM will 
also set an environment variable


SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that 
launches the MPI processes to figure out the cpuset. In case of 
OpenMPI + mpirun I think something should happen in 
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ 
actually parsed. Unfortunately, it is never really used...


As a result, cpuset of all tasks started on a given compute node 
includes all CPU cores of all MPI tasks on that node, just as 
provided by SLURM (in the above example - 8). In general, there is no 
simple way for the user code in the MPI procs to 'split' the cores 
between themselves. I imagine the original intention to support this 
in OpenMPI was something like


mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the 
allocated cores between the mpi tasks. Is this right? If so, it seems 
that at this point this is not implemented. Is there plans to do 
this? If no, does anyone know another way to achieve that?


Thanks a lot!

Marcin



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27803.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27812.php

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM