Re: [OMPI users] machinefile binding error

Galloway, Jack D Wed, 25 Feb 2015 11:01:26 -0500 (EST)

--bind-to none worked, ran just fine.  Additionally –hetero-nodes also worked 
without error.  However hetero-nodes didn’t allow threading properly while 
bind-to none did.


Is this the best option forward, adding that on all mpirun command lines or 
setting some system variables?  Or alternatively, would this work to avoid 
command line specification or environment variables?:

When you install OMPI, an "etc" directory gets created under the prefix 
location. In that directory is a file "openmpi-mca-params.conf". This is your 
default MCA param file that mpirun (and every OMPI process) reads on startup. 
You can put any params in there that you want. In this case, you'd add a line:

hwloc_base_binding_policy = none
from, http://www.open-mpi.org/community/lists/users/2014/05/24467.php

Thanks for the help,
--Jack


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, February 24, 2015 3:24 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

It looks to me like some of the nodes don’t have the required numactl packages 
installed. Why don’t you try launching the job without binding, just to see if 
everything works?

Just add “—bind-to none” to your cmd line and see if things work


On Feb 24, 2015, at 2:21 PM, Galloway, Jack D 
<ja...@lanl.gov<mailto:ja...@lanl.gov>> wrote:

I think the error may be due to a new architecture change (brought on perhaps 
by the intel compilers?).  Bad wording here, but I’m really stumbling.  As I 
add processors to the mpirun hostname call, at ~100 processors I get the 
following error, which may be informative to more seasoned eyes.  Additionally, 
I’ve attached the config.log in case something stands out, grepping on 
“catastrophic error” gives not too many results, but I don’t know if the error 
may be there or more subtle.

--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--------------------------------------------------------------------------
tebow
--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        tebow125
  Application name:  /bin/hostname
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:          odls_default_module.c:551
--------------------------------------------------------------------------

Thanks,
--Jack




From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Galloway, Jack D
Sent: Tuesday, February 24, 2015 2:31 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

Thank you sir, that fixed the first problem, hopefully the second is as easy!

I still get the second error when trying to farm out on a “large” number of 
processors:

machine file (“mach_burn_24s”):
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

mpirun –np 361 –machinefile mach_burn_24s hostname

--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE
   Node:        tebow125
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

All the compute nodes (tebow121-135) have 24+ cores on them.

Any ideas?  Thanks!

--Jack


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, February 24, 2015 1:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

Ah, now that’s a “feature” :-)

Seriously, it *is* actually a new feature of the 1.8 series. We now go out and 
actually sense the number of cores on the system and set the number of slots to 
that value unless you tell us otherwise. It was something people continually 
nagged us about, and so we made the change.

In your case, you could just put slots=1 on the first line of your machine file 
and we should respect it.


On Feb 24, 2015, at 12:49 PM, Galloway, Jack D 
<ja...@lanl.gov<mailto:ja...@lanl.gov>> wrote:

I recently upgraded my CentOS kernel and am running 2.6.32-504.8.1.el6.x86_64, 
in this upgrade I also decided to upgrade my intel/openmpi codes.

I upgraded from:

intel version 13.1.2, with openmpi 1.6.5
intel 15.0.2, with openmpi 1.8.4

Previously a command of “mpirun –np NP –machinefile MACH executable” would 
return expected results, particularly in how the machinefile was mapped to mpi 
tasks.  However, now when I try to run a code (which worked in the 13.1.2/1.6.5 
paradigm) things behave anomalously.

For instance, if I have machine file (“mach_burn_24s”) that consists of:
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

Before the allocation would follow as expected (i.e. –np 25 –machinefile above) 
would give 1 task on tebow, and 24 on tebow121, and if I assigned 361 the 
entire machinefile would be filled up.

However now it’s not the case.  If I type “mpirun -np 24 -machinefile 
burn_machs/mach_burn_24s hostname”, I get the following result:
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow121
tebow
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121

Now there are 16 cores on “tebow”, but I only requested one task in the 
machinefile (so I assume).  And furthermore if I request 361 I get the 
following catastrophic error:

--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE
   Node:        tebow125
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

All the compute nodes (tebow121-135) have 24+ cores on them.  I believe some 
configuration change has occurred which has duped the system into trying to go 
off the reported number of cores, but even then it seems to be getting things 
wrong (i.e. not pulling the right number of cores).

The config line use previously (which worked without issue according to the 
machinefile specification) was:
  $ ./configure --prefix=/opt/openmpi/openmpi-1.6.5 --with-openib 
--with-openib-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc

The config line which I now use is:
./configure --prefix=/opt/openmpi/openmpi-1.8.4 --with-verbs 
--with-verbs-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc

I’m at a loss where to look for the solution, any help is appreciated.

--Jack

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26383.php

<config.log.bz2>_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26390.php

Re: [OMPI users] machinefile binding error

Reply via email to