Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-24 Thread Ralph Castain
Ha! I finally tracked it down - a new code path that bypassed the prior error output. I have a fix going into master shortly, and will then port it to 1.10.1. Thanks for your patience! Ralph > On Sep 24, 2015, at 1:12 AM, Patrick Begou > wrote: > > Sorry for the delay. Runing mpirun whith wr

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-24 Thread Patrick Begou
Sorry for the delay. Runing mpirun whith wrong OMPI_MCA_plm_rsh_agent doesn't give any explicit message in OpenMPI-1.10.0. How I can show the problem: I request 2 nodes, 1cpu on each node, 4 cores on each cpu (so 8 cores availables with cpusets). Node file is: [begou@frog7 MPI_TESTS]$ cat $O

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-23 Thread Ralph Castain
I’m really puzzled by that one - we very definitely will report an error and exit if the user specifies that MCA param and we don’t find the given agent. Could you please send us the actual cmd line plus the hostfile you gave, and verify that the MCA param was set? > On Sep 21, 2015, at 8:42 A

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-21 Thread Gilles Gouaillardet
Patrick, thanks for the report. can you confirm what happened was - you defined OMPI_MCA_plm_rsh_agent=oarshmost - oarshmost was not in the $PATH - mpirun silently ignored the remote nodes if that is correct, then i think mpirun should have reported an error (oarshmost not found, or cannot remot

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-21 Thread Patrick Begou
Hi Gilles, I've done a big mistake! Compiling the patched version of openMPI and creating a new module, I've forgotten to add the path to oarshmost command while OMPI_MCA_plm_rsh_agent=oarshmost was set OpenMPI was silently ignoring oarshmost command as it was unable to find it and so only

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-18 Thread Gilles Gouaillardet
Thanks Patrick, could you please try again with the --hetero-nodes mpirun option ? (I am afk, and not 100% sure about the syntax) could you also submit a job with 2 nodes and 4 cores on each node, that does cat /proc/self/status oarshmost cat /proc/self/status btw, is there any reason why do yo

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-18 Thread Patrick Begou
Gilles Gouaillardet wrote: Patrick, by the way, this will work when running on a single node. i do not know what will happen when you run on multiple node ... since there is no OAR integration in openmpi, i guess you are using ssh to start orted on the remote nodes (unless you instructed ompi

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-18 Thread Gilles Gouaillardet
Patrick, by the way, this will work when running on a single node. i do not know what will happen when you run on multiple node ... since there is no OAR integration in openmpi, i guess you are using ssh to start orted on the remote nodes (unless you instructed ompi to use an OARified version

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-18 Thread Gilles Gouaillardet
Patrick, i just filled PR 586 https://github.com/open-mpi/ompi-release/pull/586 for the v1.10 series this is only a three line patch. could you please give it a try ? Cheers, Gilles On 9/18/2015 4:54 PM, Patrick Begou wrote: Ralph Castain wrote: As I said, if you don’t provide an explicit

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-18 Thread Patrick Begou
Ralph Castain wrote: As I said, if you don't provide an explicit slot count in your hostfile, we default to allowing oversubscription. We don't have OAR integration in OMPI, and so mpirun isn't recognizing that you are running under a resource manager - it thinks this is just being controlled b

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-17 Thread Ralph Castain
Thanks Gilles!! On Wed, Sep 16, 2015 at 9:21 PM, Gilles Gouaillardet wrote: > Ralph, > > you can reproduce this with master by manually creating a cpuset with less > cores than available, > and invoke mpirun with -bind-to core from within the cpuset. > > i made PR 904 https://github.com/open-mp

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-17 Thread Gilles Gouaillardet
Ralph, you can reproduce this with master by manually creating a cpuset with less cores than available, and invoke mpirun with -bind-to core from within the cpuset. i made PR 904 https://github.com/open-mpi/ompi/pull/904 Brice, can you please double check the hwloc_bitmap_isincluded invokati

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-16 Thread Ralph Castain
As I said, if you don’t provide an explicit slot count in your hostfile, we default to allowing oversubscription. We don’t have OAR integration in OMPI, and so mpirun isn’t recognizing that you are running under a resource manager - it thinks this is just being controlled by a hostfile. If you

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-16 Thread Patrick Begou
Thanks all for your answers, I've added some details about the tests I have run. See below. Ralph Castain wrote: Not precisely correct. It depends on the environment. If there is a resource manager allocating nodes, or you provide a hostfile that specifies the number of slots on the nodes,

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Ralph Castain
“We” do check the available cores - which is why I asked for details :-) > On Sep 15, 2015, at 7:10 PM, Gilles Gouaillardet > wrote: > > Ralph, > > my guess is that cupset is set by the batch manager (slurm?) > so I think this is an ompi bug/missing feature : > "we" should check the available

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Gilles Gouaillardet
Ralph, my guess is that cupset is set by the batch manager (slurm?) so I think this is an ompi bug/missing feature : "we" should check the available cores (4 in this case because of cpuset) instead of the online cores (8 in this case) I wrote "we" because it could either be ompi or hwloc, or ompi

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Ralph Castain
Not precisely correct. It depends on the environment. If there is a resource manager allocating nodes, or you provide a hostfile that specifies the number of slots on the nodes, or you use -host, then we default to no-oversubscribe. If you provide a hostfile that doesn’t specify slots, then we

Re: [OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Matt Thompson
Looking at the Open MPI 1.10.0 man page: https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php it looks like perhaps -oversubscribe (which was an option) is now the default behavior. Instead we have: *-nooversubscribe, --nooversubscribe*Do not oversubscribe any nodes; error (without starting an

[OMPI users] OpenMPI-1.10.0 bind-to core error

2015-09-15 Thread Patrick Begou
Hi, I'm runing OpenMPI 1.10.0 built with Intel 2015 compilers on a Bullx System. I've some troubles with the bind-to core option when using cpuset. If the cpuset is less than all the cores of a cpu (ex: 4 cores allowed on a 8 cores cpus) OpenMPI 1.10.0 allows to overload these cores until the m