Greetings Prentice.

This is a very generic error, it's basically just indicating "somewhere in the 
program, we got a bad pointer address."

It's very difficult to know if this issue is in Open MPI or in the application 
itself (e.g., memory corruption by the application eventually lead to bad data 
being used as a pointer, and then... kaboom).

You *may* be able to upgrade to at least the latest version of the 1.10 series: 
1.10.7.  It should be ABI compatible with 1.10.3; if the user's application is 
dynamically linked against 1.10.3, you might just be able to change 
LD_LIBRARY_PATH and point to a 1.10.7 installation.  In this way, if the bus 
error was caused by Open MPI itself, upgrading to v1.10.7 may fix it.

Other than that, based on the situation you're describing, if the problem only 
consistently happens on nodes of a specific type in your cluster, it could also 
be that the application was compiled on a machine that has a newer architecture 
than the "problem" nodes in your cluster.  As such, the compiler/assembler may 
have included instructions in the Open MPI library and/or executable that 
simply do not exist on the "problem" nodes.  When those instructions are 
(attempted to be) executed on the older/problem nodes... kaboom.

This is admittedly unlikely; I would expect to see a different kind of error 
message in these kinds of situations, but given the nature of your 
heterogeneous cluster, such things are definitely possible (e.g., an invalid 
instruction causes a failure on the MPI processes on the "problem" nodes, 
causing them to abort, but before Open MPI can kill all surviving processes, 
other MPI processes end up in error states because of the unexpected failure 
from the "problem" node processes, and at least one of them results in a bus 
error).

The rule of thumb for jobs that span heterogeneous nodes in a cluster is to 
compile/link everything on the oldest node to make sure that the 
compiler/linker don't put in instructions that won't work on old machines.  You 
can compile on newer nodes and use specific compiler/linker flags to restrict 
generated instructions, too, but it can be difficult to track down the precise 
flags that you need.



> On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users 
> <users@lists.open-mpi.org> wrote:
> 
> I manage a very heterogeneous cluster. I have nodes of different ages with 
> different processors, different amounts of RAM, etc. One user is reporting 
> that on certain nodes, his jobs keep crashing with the errors below. His 
> application is using OpenMPI 1.10.3, which I know is an ancient version of 
> OpenMPI, but someone else in his research group built the code with that, so 
> that's what he's stuck with.
> 
> I did a Google search of "Signal code: Non-existant physical address", and it 
> appears that this may be a bug in 1.10.3 that happens on certain hardware. 
> Can anyone else confirm this? The full error message is below:
> 
> [dawson120:29064] *** Process received signal ***
> [dawson120:29062] *** Process received signal ***
> [dawson120:29062] Signal: Bus error (7)
> [dawson120:29062] Signal code: Non-existant physical address (2)
> [dawson120:29062] Failing at address: 0x7ff3f030f180
> [dawson120:29067] *** Process received signal ***
> [dawson120:29067] Signal: Bus error (7)
> [dawson120:29067] Signal code: Non-existant physical address (2)
> [dawson120:29067] Failing at address: 0x7fb2b8a61d18
> [dawson120:29077] *** Process received signal ***
> [dawson120:29078] *** Process received signal ***
> [dawson120:29078] Signal: Bus error (7)
> [dawson120:29078] Signal code: Non-existant physical address (2)
> [dawson120:29078] Failing at address: 0x7f60a13d2c98
> [dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
> [dawson120:29078] [ 1] 
> /usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]
> 
> I've asked the user to switch to a newer version of OpenMPI, but since his 
> research group is all using the same application and someone else built it, 
> he's not in a position to do that. For now, he's excluding the "bad" nodes 
> with Slurm -x option.
> 
> I just want to know if this is in fact a bug in 1.10.3, or if there's 
> something we can do to fix this error.
> 
> Thanks,
> 
> -- 
> Prentice
> 


-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to