Jeff,
Thanks for the detailed explanation. After some googling, it seemed like
this might be a bug in 1.10.3 that only reveals itself on certain
hardware. Since my user isn't interested in using a newer OpenMPI (but
he will be forced to soon enough when we upgrade our cluster!), he has
been using Slurms exclude feature to exclude those problem nodes.
The goods news is that in the fall we will have a new, homogeneous
cluster with all new hardware.
Prentice
On 7/6/20 7:47 AM, Jeff Squyres (jsquyres) wrote:
Greetings Prentice.
This is a very generic error, it's basically just indicating "somewhere in the
program, we got a bad pointer address."
It's very difficult to know if this issue is in Open MPI or in the application
itself (e.g., memory corruption by the application eventually lead to bad data
being used as a pointer, and then... kaboom).
You *may* be able to upgrade to at least the latest version of the 1.10 series:
1.10.7. It should be ABI compatible with 1.10.3; if the user's application is
dynamically linked against 1.10.3, you might just be able to change
LD_LIBRARY_PATH and point to a 1.10.7 installation. In this way, if the bus
error was caused by Open MPI itself, upgrading to v1.10.7 may fix it.
Other than that, based on the situation you're describing, if the problem only consistently happens
on nodes of a specific type in your cluster, it could also be that the application was compiled on
a machine that has a newer architecture than the "problem" nodes in your cluster. As
such, the compiler/assembler may have included instructions in the Open MPI library and/or
executable that simply do not exist on the "problem" nodes. When those instructions are
(attempted to be) executed on the older/problem nodes... kaboom.
This is admittedly unlikely; I would expect to see a different kind of error message in these kinds
of situations, but given the nature of your heterogeneous cluster, such things are definitely
possible (e.g., an invalid instruction causes a failure on the MPI processes on the
"problem" nodes, causing them to abort, but before Open MPI can kill all surviving
processes, other MPI processes end up in error states because of the unexpected failure from the
"problem" node processes, and at least one of them results in a bus error).
The rule of thumb for jobs that span heterogeneous nodes in a cluster is to
compile/link everything on the oldest node to make sure that the
compiler/linker don't put in instructions that won't work on old machines. You
can compile on newer nodes and use specific compiler/linker flags to restrict
generated instructions, too, but it can be difficult to track down the precise
flags that you need.
On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users
<users@lists.open-mpi.org> wrote:
I manage a very heterogeneous cluster. I have nodes of different ages with
different processors, different amounts of RAM, etc. One user is reporting that
on certain nodes, his jobs keep crashing with the errors below. His application
is using OpenMPI 1.10.3, which I know is an ancient version of OpenMPI, but
someone else in his research group built the code with that, so that's what
he's stuck with.
I did a Google search of "Signal code: Non-existant physical address", and it
appears that this may be a bug in 1.10.3 that happens on certain hardware. Can anyone
else confirm this? The full error message is below:
[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1]
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]
I've asked the user to switch to a newer version of OpenMPI, but since his research group
is all using the same application and someone else built it, he's not in a position to do
that. For now, he's excluding the "bad" nodes with Slurm -x option.
I just want to know if this is in fact a bug in 1.10.3, or if there's something
we can do to fix this error.
Thanks,
--
Prentice
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov