[OMPI users] Another OpenMPI 5.0.1. Installation Failure

2024-04-21 Thread Kook Jin Noh via users
Hi,

I'm installting OpenMPI on Archlinux 6.7.0. Everything goes well till:


Testsuite summary for Open MPI 5.0.1

# TOTAL: 13
# PASS:  13
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
Making check in util
make[2]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  opal_bit_ops opal_path_nfs bipartite_graph opal_sha256
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
  CCLD opal_bit_ops
  CCLD opal_path_nfs
  CCLD bipartite_graph
  CCLD opal_sha256
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  check-TESTS
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[4]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
PASS: opal_bit_ops
FAIL: opal_path_nfs
PASS: bipartite_graph
PASS: opal_sha256

Testsuite summary for Open MPI 5.0.1

# TOTAL: 4
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

See test/util/test-suite.log
Please report to https://www.open-mpi.org/community/help/

make[4]: *** [Makefile:1838: test-suite.log] Error 1
make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[3]: *** [Makefile:1946: check-TESTS] Error 2
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[2]: *** [Makefile:2040: check-am] Error 2
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[1]: *** [Makefile:1416: check-recursive] Error 1
make[1]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test'
make: *** [Makefile:1533: check-recursive] Error 1
make: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1'
==> ERROR: A failure occurred in check().
Aborting...
[vorlket@miniserver openmpi-ucx]$

Could you please help me understanding what's going on? Thanks.


[OMPI users] PRTE mismatch

2024-04-21 Thread Kook Jin Noh via users
Hi, while trying to run a MPI program, I received the following:

[vorlket@server ~]$ mpirun -host server:4,midiserver:4,miniserver:1 -np 9 
/home/vorlket/sharedfolder/mpi-prime
--
PRTE detected a mismatch in versions between two processes.  This
typically means that you executed "mpirun" (or "mpiexec") from one
version of PRTE on on node, but your default path on one of the
other nodes upon which you launched found a different version of Open
MPI.

PRTE only supports running exactly the same version between all
processes in a single job.

This will almost certainly cause unpredictable behavior, and may end
up aborting your job.

  Local host: server
  Local process name: [prterun-server-1089@0,0]
  Local PRTE version: 3.0.3
  Peer host:  Unknown
  Peer process name:  [prterun-server-1089@0,1]
  Peer PRTE version:  3.0.5
--
--
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-server-1089@0,0] on node server
  Remote daemon: [prterun-server-1089@0,1] on node midiserver

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

Could you please tell me what I can do to either have the same PRTE version 
across the servers or something else to make the program run?

Thanks.


Re: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

2024-04-21 Thread Kook Jin Noh via users
I makepkg -s again and it installed. The problem solved but would be nice if 
someone can explain the situation. Thanks.

From: users  On Behalf Of Kook Jin Noh via 
users
Sent: Sunday, 21 April 2024 8:18 PM
To: users@lists.open-mpi.org
Cc: Kook Jin Noh 
Subject: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

Hi,

I'm installting OpenMPI on Archlinux 6.7.0. Everything goes well till:


Testsuite summary for Open MPI 5.0.1

# TOTAL: 13
# PASS:  13
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
Making check in util
make[2]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  opal_bit_ops opal_path_nfs bipartite_graph opal_sha256
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
  CCLD opal_bit_ops
  CCLD opal_path_nfs
  CCLD bipartite_graph
  CCLD opal_sha256
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  check-TESTS
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[4]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
PASS: opal_bit_ops
FAIL: opal_path_nfs
PASS: bipartite_graph
PASS: opal_sha256

Testsuite summary for Open MPI 5.0.1

# TOTAL: 4
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

See test/util/test-suite.log
Please report to https://www.open-mpi.org/community/help/

make[4]: *** [Makefile:1838: test-suite.log] Error 1
make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[3]: *** [Makefile:1946: check-TESTS] Error 2
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[2]: *** [Makefile:2040: check-am] Error 2
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[1]: *** [Makefile:1416: check-recursive] Error 1
make[1]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test'
make: *** [Makefile:1533: check-recursive] Error 1
make: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1'
==> ERROR: A failure occurred in check().
Aborting...
[vorlket@miniserver openmpi-ucx]$

Could you please help me understanding what's going on? Thanks.


Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently

2024-04-21 Thread Mehmet Oren via users
Hi Greg,

Sorry for the late response , I have strict working hours and project schedule.
I do not have much to help on that problem, but according to your description 
there could be some points to check.
You stated that you don't have problems with IntelMPI. As far I know, with 
--bind-to-none option IntelMPI allocates task affinity internally so this could 
be failure point for OpenMPI. Do you have task affinity or cgroup settings 
anywhere in CI pipeline? Because these settings could prevent OpenMPI to set 
task affinity for the tasks.

Best regards,
Mehmet


Mehmet OREN


From: Greg Samonds 
Sent: Thursday, April 18, 2024 2:55 AM
To: Mehmet Oren ; Open MPI Users 
; Gilles Gouaillardet 
Cc: Adnane Khattabi ; Philippe Rouchon 

Subject: RE: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently


Hi Mehmet, Gilles,



Thanks for your support on this topic.



  *   I gave "--mca pml ^ucx" a try but unfortunately the jobs failed with “ 
MPI_INIT has failed because at least one MPI process is unreachable from 
another”.
  *   We use a Python-based launcher which launches an mpiexec command through 
a subprocess, and the SIGILL error occurs immediately after this - before our 
Fortran application prints out any information or begins the simulation.
  *   The “[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all 
available processors)” messages can also occur in cases which are successful, 
and do not crash with a SIGILL (I only just realized this, so the subject of 
this email is not correct, sorry about that).
  *   We do actually apply the full HPC-X environment.  Our Python launcher has 
a mechanism which launches a shell and runs “source hpcx-init.sh” and 
“hpcx_load”, and it then copies back this environment so it can be passed to 
the mpiexec command.
  *   We don’t encounter this issue when running in the same context using the 
Intel MPI/x86_64 version of our software, which uses the same source code 
branch and only differs in the libraries its linked with.
  *   The full execution context is: Jenkins (Groovy-based) -> Python script -> 
Podman container -> Shell script -> Make test -> Python (launcher application) 
-> MPI Fortran application
  *   We can’t reproduce the issue when omitting the first two steps by running 
on a similar machine outside of the CI (starting from the “Podman container” 
step)



It’s very bizarre that we can only reproduce this within our Jenkins CI system, 
but not while running it directly, even with the same container, hardware, OS, 
kernel, etc.  For the lack of a better idea, I wonder if there could possibly 
be some strange interaction between the JVM (for Jenkins) running on the 
machine and MPI, but I don’t see how the operating system could allow something 
like this to happen.



Perhaps we can try increasing the verbosity of MPI’s output and comparing what 
we get from within the CI to what we get locally.  Would “--mca 
btl_base_verbose 100” and “--mca pml_base_verbose 100” be the best way to do 
this, or would you recommend something more specific for this situation?



Regards,

Greg



From: Mehmet Oren 
Sent: Wednesday, April 17, 2024 5:11 PM
To: Open MPI Users 
Cc: Greg Samonds ; Adnane Khattabi 
; Philippe Rouchon 

Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently



Hi Greg,



I am not an openmpi expert but I just wanted to share my experience with HPC-X.

  1.  Default HPC-X builds which come with the mofed drivers are built with UCX 
and as Gilles stated, specifying ob1 will not change the layer for openmpi. You 
can try to discard UCX and let the openmpi decide for the layer by adding 
"--mca pml ^ucx" to your command line.
  2.  HPC-X comes with two scripts named mpivars.sh and mpivars.csh 
respectively under bin folder. It could be a better option to source mpivars.sh 
before running your job instead of adding LD_LIBRARY_PATH. By sourcing this 
script, you can set up all required paths and environment variables easily and 
fix most of the run time problems.
  3.  And also, please check hwloc and it is dependencies which usually are not 
present with default os installations and container images.



Regards,

Mehmet



From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Greg Samonds via users 
mailto:users@lists.open-mpi.org>>
Sent: Tuesday, April 16, 2024 5:50 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Greg Samonds 
mailto:greg.samo...@esi-group.com>>; Adnane 
Khattabi mailto:adnane.khatt...@esi-group.com>>; 
Philippe Rouchon 
mailto:philippe.rouc...@esi-group.com>>
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently



Hi Gilles,



Thanks for your assistance.



I tried the rec

Re: [OMPI users] PRTE mismatch

2024-04-21 Thread Kook Jin Noh via users
pacman -R prrte, pacman -Syu prrte solved it.

From: users  On Behalf Of Kook Jin Noh via 
users
Sent: Sunday, 21 April 2024 9:25 PM
To: users@lists.open-mpi.org
Cc: Kook Jin Noh 
Subject: [OMPI users] PRTE mismatch

Hi, while trying to run a MPI program, I received the following:

[vorlket@server ~]$ mpirun -host server:4,midiserver:4,miniserver:1 -np 9 
/home/vorlket/sharedfolder/mpi-prime
--
PRTE detected a mismatch in versions between two processes.  This
typically means that you executed "mpirun" (or "mpiexec") from one
version of PRTE on on node, but your default path on one of the
other nodes upon which you launched found a different version of Open
MPI.

PRTE only supports running exactly the same version between all
processes in a single job.

This will almost certainly cause unpredictable behavior, and may end
up aborting your job.

  Local host: server
  Local process name: [prterun-server-1089@0,0]
  Local PRTE version: 3.0.3
  Peer host:  Unknown
  Peer process name:  [prterun-server-1089@0,1]
  Peer PRTE version:  3.0.5
--
--
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-server-1089@0,0] on node server
  Remote daemon: [prterun-server-1089@0,1] on node midiserver

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

Could you please tell me what I can do to either have the same PRTE version 
across the servers or something else to make the program run?

Thanks.


Re: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

2024-04-21 Thread Kook Jin Noh via users
After updating prrte and I tried to install openmpi and the problem arises 
again. Please help. Thanks.

From: users  On Behalf Of Kook Jin Noh via 
users
Sent: Sunday, 21 April 2024 9:27 PM
To: Open MPI Users 
Cc: Kook Jin Noh 
Subject: Re: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

I makepkg -s again and it installed. The problem solved but would be nice if 
someone can explain the situation. Thanks.

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Kook Jin Noh via users
Sent: Sunday, 21 April 2024 8:18 PM
To: users@lists.open-mpi.org
Cc: Kook Jin Noh mailto:vorl...@outlook.com>>
Subject: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

Hi,

I'm installting OpenMPI on Archlinux 6.7.0. Everything goes well till:


Testsuite summary for Open MPI 5.0.1

# TOTAL: 13
# PASS:  13
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
Making check in util
make[2]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  opal_bit_ops opal_path_nfs bipartite_graph opal_sha256
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
  CCLD opal_bit_ops
  CCLD opal_path_nfs
  CCLD bipartite_graph
  CCLD opal_sha256
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make  check-TESTS
make[3]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[4]: Entering directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
PASS: opal_bit_ops
FAIL: opal_path_nfs
PASS: bipartite_graph
PASS: opal_sha256

Testsuite summary for Open MPI 5.0.1

# TOTAL: 4
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

See test/util/test-suite.log
Please report to https://www.open-mpi.org/community/help/

make[4]: *** [Makefile:1838: test-suite.log] Error 1
make[4]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[3]: *** [Makefile:1946: check-TESTS] Error 2
make[3]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[2]: *** [Makefile:2040: check-am] Error 2
make[2]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
make[1]: *** [Makefile:1416: check-recursive] Error 1
make[1]: Leaving directory 
'/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test'
make: *** [Makefile:1533: check-recursive] Error 1
make: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1'
==> ERROR: A failure occurred in check().
Aborting...
[vorlket@miniserver openmpi-ucx]$

Could you please help me understanding what's going on? Thanks.


Re: [OMPI users] Another OpenMPI 5.0.1. Installation Failure

2024-04-21 Thread Gilles Gouaillardet via users
Hi,

Is there any reason why you do not build the latest 5.0.2 package?

Anyway, the issue could be related to an unknown filesystem.
Do you get a meaningful error if you manually run
/.../test/util/opal_path_nfs?
If not, can you share the output of

mount | cut -f3,5 -d' '

Cheers,

Gilles

On Sun, Apr 21, 2024 at 10:04 PM Kook Jin Noh via users <
users@lists.open-mpi.org> wrote:

> After updating prrte and I tried to install openmpi and the problem arises
> again. Please help. Thanks.
>
>
>
> *From:* users  *On Behalf Of *Kook Jin
> Noh via users
> *Sent:* Sunday, 21 April 2024 9:27 PM
> *To:* Open MPI Users 
> *Cc:* Kook Jin Noh 
> *Subject:* Re: [OMPI users] Another OpenMPI 5.0.1. Installation Failure
>
>
>
> I makepkg -s again and it installed. The problem solved but would be nice
> if someone can explain the situation. Thanks.
>
>
>
> *From:* users  *On Behalf Of *Kook Jin
> Noh via users
> *Sent:* Sunday, 21 April 2024 8:18 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Kook Jin Noh 
> *Subject:* [OMPI users] Another OpenMPI 5.0.1. Installation Failure
>
>
>
> Hi,
>
>
>
> I’m installting OpenMPI on Archlinux 6.7.0. Everything goes well till:
>
>
>
>
> 
>
> Testsuite summary for Open MPI 5.0.1
>
>
> 
>
> # TOTAL: 13
>
> # PASS:  13
>
> # SKIP:  0
>
> # XFAIL: 0
>
> # FAIL:  0
>
> # XPASS: 0
>
> # ERROR: 0
>
>
> 
>
> make[4]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
>
> make[3]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
>
> make[2]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
>
> Making check in util
>
> make[2]: Entering directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make  opal_bit_ops opal_path_nfs bipartite_graph opal_sha256
>
> make[3]: Entering directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
>   CCLD opal_bit_ops
>
>   CCLD opal_path_nfs
>
>   CCLD bipartite_graph
>
>   CCLD opal_sha256
>
> make[3]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make  check-TESTS
>
> make[3]: Entering directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make[4]: Entering directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> PASS: opal_bit_ops
>
> FAIL: opal_path_nfs
>
> PASS: bipartite_graph
>
> PASS: opal_sha256
>
>
> 
>
> Testsuite summary for Open MPI 5.0.1
>
>
> 
>
> # TOTAL: 4
>
> # PASS:  3
>
> # SKIP:  0
>
> # XFAIL: 0
>
> # FAIL:  1
>
> # XPASS: 0
>
> # ERROR: 0
>
>
> 
>
> See test/util/test-suite.log
>
> Please report to https://www.open-mpi.org/community/help/
>
>
> 
>
> make[4]: *** [Makefile:1838: test-suite.log] Error 1
>
> make[4]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make[3]: *** [Makefile:1946: check-TESTS] Error 2
>
> make[3]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make[2]: *** [Makefile:2040: check-am] Error 2
>
> make[2]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/util'
>
> make[1]: *** [Makefile:1416: check-recursive] Error 1
>
> make[1]: Leaving directory
> '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test'
>
> make: *** [Makefile:1533: check-recursive] Error 1
>
> make: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1'
>
> ==> ERROR: A failure occurred in check().
>
> Aborting...
>
> [vorlket@miniserver openmpi-ucx]$
>
>
>
> Could you please help me understanding what’s going on? Thanks.
>