Re: [OMPI users] [Open MPI Announce] Online presentation: the ABCs of Open MPI

2020-07-06 Thread Jeff Squyres (jsquyres) via users
Gentle reminder that part 2 of "The ABCs of Open MPI" will be this Wednesday, 8 
July, 2020 at:

- 8am US Pacific time
- 11am US Eastern time
- 3pm UTC
- 5pm CEST

Ralph and I will be continuing our discussion and explanations of the Open MPI 
ecosystem.  The Webex link to join is on the event wiki page:


https://github.com/easybuilders/easybuild/wiki/EasyBuild-Tech-Talks-I:-Open-MPI
The wiki page also has links to the slides and video from the first session.
We've also linked the slides and video on the main Open MPI web 
site.

Additionally, Ralph and I decided that we have so much material that we're 
actually extending to have a *third* session on Wednesday August 5th, 2020 (in 
the same time slot).

Please share this information with others who may be interested in attending 
the 2nd and/or 3rd sessions.



On Jun 22, 2020, at 12:10 PM, Jeff Squyres 
mailto:jsquy...@cisco.com>> wrote:

After assembling the content for this online presentation (based on questions 
and comments from the user community), we have so much material to cover that 
we're going to split it into two sessions.

The first part will be **this Wednesday (24 June 2020)*** at:

- 8am US Pacific time
- 11am US Eastern time
- 3pm UTC
- 5pm CEST

The second part will be two weeks later, on Wednesday, 8 July, 2020, in the 
same time slot.

   
https://github.com/easybuilders/easybuild/wiki/EasyBuild-Tech-Talks-I:-Open-MPI

Anyone is free to join either / both parts.

Hope to see you this Wednesday!




On Jun 14, 2020, at 2:05 PM, Jeff Squyres (jsquyres) via announce 
mailto:annou...@lists.open-mpi.org>> wrote:

In conjunction with the EasyBuild community, Ralph Castain (Intel, Open MPI, 
PMIx) and Jeff Squyres (Cisco, Open MPI) will host an online presentation about 
Open MPI on **Wednesday June 24th 2020** at:

- 11am US Eastern time
- 8am US Pacific time
- 3pm UTC
- 5pm CEST

The general scope of the presentation will be to demystify the alphabet soup of 
the Open MPI ecosystem: the user-facing frameworks and components, the 3rd 
party dependencies, etc.  More information, including topics to be covered and 
WebEx connection details, is available at:

https://github.com/easybuilders/easybuild/wiki/EasyBuild-Tech-Talks-I:-Open-MPI

The presentation is open for anyone to join.  There is no need to register up 
front, just show up!

The session will be recorded and will be available after the fact.

Please share this information with others who may be interested in attending.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Signal code: Non-existant physical address (2)

2020-07-06 Thread Jeff Squyres (jsquyres) via users
Greetings Prentice.

This is a very generic error, it's basically just indicating "somewhere in the 
program, we got a bad pointer address."

It's very difficult to know if this issue is in Open MPI or in the application 
itself (e.g., memory corruption by the application eventually lead to bad data 
being used as a pointer, and then... kaboom).

You *may* be able to upgrade to at least the latest version of the 1.10 series: 
1.10.7.  It should be ABI compatible with 1.10.3; if the user's application is 
dynamically linked against 1.10.3, you might just be able to change 
LD_LIBRARY_PATH and point to a 1.10.7 installation.  In this way, if the bus 
error was caused by Open MPI itself, upgrading to v1.10.7 may fix it.

Other than that, based on the situation you're describing, if the problem only 
consistently happens on nodes of a specific type in your cluster, it could also 
be that the application was compiled on a machine that has a newer architecture 
than the "problem" nodes in your cluster.  As such, the compiler/assembler may 
have included instructions in the Open MPI library and/or executable that 
simply do not exist on the "problem" nodes.  When those instructions are 
(attempted to be) executed on the older/problem nodes... kaboom.

This is admittedly unlikely; I would expect to see a different kind of error 
message in these kinds of situations, but given the nature of your 
heterogeneous cluster, such things are definitely possible (e.g., an invalid 
instruction causes a failure on the MPI processes on the "problem" nodes, 
causing them to abort, but before Open MPI can kill all surviving processes, 
other MPI processes end up in error states because of the unexpected failure 
from the "problem" node processes, and at least one of them results in a bus 
error).

The rule of thumb for jobs that span heterogeneous nodes in a cluster is to 
compile/link everything on the oldest node to make sure that the 
compiler/linker don't put in instructions that won't work on old machines.  You 
can compile on newer nodes and use specific compiler/linker flags to restrict 
generated instructions, too, but it can be difficult to track down the precise 
flags that you need.



> On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users 
>  wrote:
> 
> I manage a very heterogeneous cluster. I have nodes of different ages with 
> different processors, different amounts of RAM, etc. One user is reporting 
> that on certain nodes, his jobs keep crashing with the errors below. His 
> application is using OpenMPI 1.10.3, which I know is an ancient version of 
> OpenMPI, but someone else in his research group built the code with that, so 
> that's what he's stuck with.
> 
> I did a Google search of "Signal code: Non-existant physical address", and it 
> appears that this may be a bug in 1.10.3 that happens on certain hardware. 
> Can anyone else confirm this? The full error message is below:
> 
> [dawson120:29064] *** Process received signal ***
> [dawson120:29062] *** Process received signal ***
> [dawson120:29062] Signal: Bus error (7)
> [dawson120:29062] Signal code: Non-existant physical address (2)
> [dawson120:29062] Failing at address: 0x7ff3f030f180
> [dawson120:29067] *** Process received signal ***
> [dawson120:29067] Signal: Bus error (7)
> [dawson120:29067] Signal code: Non-existant physical address (2)
> [dawson120:29067] Failing at address: 0x7fb2b8a61d18
> [dawson120:29077] *** Process received signal ***
> [dawson120:29078] *** Process received signal ***
> [dawson120:29078] Signal: Bus error (7)
> [dawson120:29078] Signal code: Non-existant physical address (2)
> [dawson120:29078] Failing at address: 0x7f60a13d2c98
> [dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
> [dawson120:29078] [ 1] 
> /usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]
> 
> I've asked the user to switch to a newer version of OpenMPI, but since his 
> research group is all using the same application and someone else built it, 
> he's not in a position to do that. For now, he's excluding the "bad" nodes 
> with Slurm -x option.
> 
> I just want to know if this is in fact a bug in 1.10.3, or if there's 
> something we can do to fix this error.
> 
> Thanks,
> 
> -- 
> Prentice
> 


-- 
Jeff Squyres
jsquy...@cisco.com