Greetings Prentice. This is a very generic error, it's basically just indicating "somewhere in the program, we got a bad pointer address."
It's very difficult to know if this issue is in Open MPI or in the application itself (e.g., memory corruption by the application eventually lead to bad data being used as a pointer, and then... kaboom). You *may* be able to upgrade to at least the latest version of the 1.10 series: 1.10.7. It should be ABI compatible with 1.10.3; if the user's application is dynamically linked against 1.10.3, you might just be able to change LD_LIBRARY_PATH and point to a 1.10.7 installation. In this way, if the bus error was caused by Open MPI itself, upgrading to v1.10.7 may fix it. Other than that, based on the situation you're describing, if the problem only consistently happens on nodes of a specific type in your cluster, it could also be that the application was compiled on a machine that has a newer architecture than the "problem" nodes in your cluster. As such, the compiler/assembler may have included instructions in the Open MPI library and/or executable that simply do not exist on the "problem" nodes. When those instructions are (attempted to be) executed on the older/problem nodes... kaboom. This is admittedly unlikely; I would expect to see a different kind of error message in these kinds of situations, but given the nature of your heterogeneous cluster, such things are definitely possible (e.g., an invalid instruction causes a failure on the MPI processes on the "problem" nodes, causing them to abort, but before Open MPI can kill all surviving processes, other MPI processes end up in error states because of the unexpected failure from the "problem" node processes, and at least one of them results in a bus error). The rule of thumb for jobs that span heterogeneous nodes in a cluster is to compile/link everything on the oldest node to make sure that the compiler/linker don't put in instructions that won't work on old machines. You can compile on newer nodes and use specific compiler/linker flags to restrict generated instructions, too, but it can be difficult to track down the precise flags that you need. > On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users > <users@lists.open-mpi.org> wrote: > > I manage a very heterogeneous cluster. I have nodes of different ages with > different processors, different amounts of RAM, etc. One user is reporting > that on certain nodes, his jobs keep crashing with the errors below. His > application is using OpenMPI 1.10.3, which I know is an ancient version of > OpenMPI, but someone else in his research group built the code with that, so > that's what he's stuck with. > > I did a Google search of "Signal code: Non-existant physical address", and it > appears that this may be a bug in 1.10.3 that happens on certain hardware. > Can anyone else confirm this? The full error message is below: > > [dawson120:29064] *** Process received signal *** > [dawson120:29062] *** Process received signal *** > [dawson120:29062] Signal: Bus error (7) > [dawson120:29062] Signal code: Non-existant physical address (2) > [dawson120:29062] Failing at address: 0x7ff3f030f180 > [dawson120:29067] *** Process received signal *** > [dawson120:29067] Signal: Bus error (7) > [dawson120:29067] Signal code: Non-existant physical address (2) > [dawson120:29067] Failing at address: 0x7fb2b8a61d18 > [dawson120:29077] *** Process received signal *** > [dawson120:29078] *** Process received signal *** > [dawson120:29078] Signal: Bus error (7) > [dawson120:29078] Signal code: Non-existant physical address (2) > [dawson120:29078] Failing at address: 0x7f60a13d2c98 > [dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0] > [dawson120:29078] [ 1] > /usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4] > > I've asked the user to switch to a newer version of OpenMPI, but since his > research group is all using the same application and someone else built it, > he's not in a position to do that. For now, he's excluding the "bad" nodes > with Slurm -x option. > > I just want to know if this is in fact a bug in 1.10.3, or if there's > something we can do to fix this error. > > Thanks, > > -- > Prentice > -- Jeff Squyres jsquy...@cisco.com