I recently upgraded my CentOS kernel and am running 2.6.32-504.8.1.el6.x86_64, in this upgrade I also decided to upgrade my intel/openmpi codes.
I upgraded from: intel version 13.1.2, with openmpi 1.6.5 intel 15.0.2, with openmpi 1.8.4 Previously a command of "mpirun -np NP -machinefile MACH executable" would return expected results, particularly in how the machinefile was mapped to mpi tasks. However, now when I try to run a code (which worked in the 13.1.2/1.6.5 paradigm) things behave anomalously. For instance, if I have machine file ("mach_burn_24s") that consists of: tebow tebow121 slots=24 tebow122 slots=24 tebow123 slots=24 tebow124 slots=24 tebow125 slots=24 tebow126 slots=24 tebow127 slots=24 tebow128 slots=24 tebow129 slots=24 tebow130 slots=24 tebow131 slots=24 tebow132 slots=24 tebow133 slots=24 tebow134 slots=24 tebow135 slots=24 Before the allocation would follow as expected (i.e. -np 25 -machinefile above) would give 1 task on tebow, and 24 on tebow121, and if I assigned 361 the entire machinefile would be filled up. However now it's not the case. If I type "mpirun -np 24 -machinefile burn_machs/mach_burn_24s hostname", I get the following result: tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow tebow121 tebow tebow121 tebow121 tebow121 tebow121 tebow121 tebow121 tebow121 Now there are 16 cores on "tebow", but I only requested one task in the machinefile (so I assume). And furthermore if I request 361 I get the following catastrophic error: -------------------------------------------------------------------------- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: tebow124 This usually is due to not having the required NUMA support installed on the node. In some Linux distributions, the required support is contained in the libnumactl and libnumactl-devel packages. This is a warning only; your job will continue, though performance may be degraded. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: NONE Node: tebow125 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- All the compute nodes (tebow121-135) have 24+ cores on them. I believe some configuration change has occurred which has duped the system into trying to go off the reported number of cores, but even then it seems to be getting things wrong (i.e. not pulling the right number of cores). The config line use previously (which worked without issue according to the machinefile specification) was: $ ./configure --prefix=/opt/openmpi/openmpi-1.6.5 --with-openib --with-openib-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc The config line which I now use is: ./configure --prefix=/opt/openmpi/openmpi-1.8.4 --with-verbs --with-verbs-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc I'm at a loss where to look for the solution, any help is appreciated. --Jack