Jeff,

Thanks for the detailed explanation. After some googling, it seemed like this might be a bug in 1.10.3 that only reveals itself on certain hardware. Since my user isn't interested in using a newer OpenMPI (but he will be forced to soon enough when we upgrade our cluster!), he has been using Slurms exclude feature to exclude those problem nodes.

The goods news is that in the fall we will have a new, homogeneous cluster with all new hardware.

Prentice

On 7/6/20 7:47 AM, Jeff Squyres (jsquyres) wrote:
Greetings Prentice.

This is a very generic error, it's basically just indicating "somewhere in the 
program, we got a bad pointer address."

It's very difficult to know if this issue is in Open MPI or in the application 
itself (e.g., memory corruption by the application eventually lead to bad data 
being used as a pointer, and then... kaboom).

You *may* be able to upgrade to at least the latest version of the 1.10 series: 
1.10.7.  It should be ABI compatible with 1.10.3; if the user's application is 
dynamically linked against 1.10.3, you might just be able to change 
LD_LIBRARY_PATH and point to a 1.10.7 installation.  In this way, if the bus 
error was caused by Open MPI itself, upgrading to v1.10.7 may fix it.

Other than that, based on the situation you're describing, if the problem only consistently happens 
on nodes of a specific type in your cluster, it could also be that the application was compiled on 
a machine that has a newer architecture than the "problem" nodes in your cluster.  As 
such, the compiler/assembler may have included instructions in the Open MPI library and/or 
executable that simply do not exist on the "problem" nodes.  When those instructions are 
(attempted to be) executed on the older/problem nodes... kaboom.

This is admittedly unlikely; I would expect to see a different kind of error message in these kinds 
of situations, but given the nature of your heterogeneous cluster, such things are definitely 
possible (e.g., an invalid instruction causes a failure on the MPI processes on the 
"problem" nodes, causing them to abort, but before Open MPI can kill all surviving 
processes, other MPI processes end up in error states because of the unexpected failure from the 
"problem" node processes, and at least one of them results in a bus error).

The rule of thumb for jobs that span heterogeneous nodes in a cluster is to 
compile/link everything on the oldest node to make sure that the 
compiler/linker don't put in instructions that won't work on old machines.  You 
can compile on newer nodes and use specific compiler/linker flags to restrict 
generated instructions, too, but it can be difficult to track down the precise 
flags that you need.



On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users 
<users@lists.open-mpi.org> wrote:

I manage a very heterogeneous cluster. I have nodes of different ages with 
different processors, different amounts of RAM, etc. One user is reporting that 
on certain nodes, his jobs keep crashing with the errors below. His application 
is using OpenMPI 1.10.3, which I know is an ancient version of OpenMPI, but 
someone else in his research group built the code with that, so that's what 
he's stuck with.

I did a Google search of "Signal code: Non-existant physical address", and it 
appears that this may be a bug in 1.10.3 that happens on certain hardware. Can anyone 
else confirm this? The full error message is below:

[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1] 
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]

I've asked the user to switch to a newer version of OpenMPI, but since his research group 
is all using the same application and someone else built it, he's not in a position to do 
that. For now, he's excluding the "bad" nodes with Slurm -x option.

I just want to know if this is in fact a bug in 1.10.3, or if there's something 
we can do to fix this error.

Thanks,

--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Reply via email to