I'm having an issue with OpenMPI that just started today. A couple of days
ago everything was fine. I could run mpiexec/mpirun using --hostfile flag.
I didn't touch the system for those couple of days. I'm just messing around
learning MPI using C. These are simple programs from "Parallel Programming
with MPI"

Specs:

1 RPI 4 8gb - Ubuntu 20.04 - OpenMPI 4.0.3 designated as node00
1 RPi 4 4gb - Ubuntu 20.04 - OpenMPI 4.0.3 designated as node01

Using NFS that is physically connected to node00. Any changes made in the
NFS directory is seen by both nodes. The NFS directory is set to the
user/group set for both nodes with read/write permissions set to 774

I can SSH from one to the other and back again; so node00 to node01 to
node00

I can mpiexec/mpirun on a single node without an issue. Initially I could
not run with --hostfile flag and would get this error using when executed
from either node:

ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 1200
--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).

Magically node01 started working using the hostfile and will run across
both nodes using all 8 processes.

I'm still getting the error above when I try to run when initiated from
node00 using --hostfile flag.

Both node00 and node01 use same username and group and their respective
uid/gid are identical. LD_LIBRARY_PATH and PATH variables are set and
identical. Again, I can run 4 processes on a single node. It is only node00
that will not allow me to start a job using both nodes.

Also, as previously stated, this was all working a couple of days ago.

The RPi's are in the same room but on separate switches. (I had intentions
of using node00 for something and wanted it on my desk as on the shelf next
to node01).

I've rebooted several times to no avail.

Any thoughts?

Reply via email to