Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Patrick Begou

Le 14/02/2025 à 13:22, Sangam B a écrit :

Hi,

OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1 
compilers. An mpi program is compiled with this openmpi-5.0.6.


While submitting job thru PBS on a Linux cluster, the intel compilers 
is sourced and the same is passed thru OpenMPI's mpirun command 
option: " -x LD_LIBRARY_PATH= ". But 
still the job fails with following error:


prted: error while loading shared libraries: libimf.so: cannot open 
shared object file: No such file or directory


PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
  Remote daemon: [prterun-cn19-2146925@0,2] on node cn21

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

However, if put "source vars.sh" in the 
~/.bashrc, then job works fine. But this is not the right way to do so.


But my question here is that, after passing -x LD_LIBRARY_PATH to 
mpirun command, why it is not able to find the "libimf.so" on all the 
nodes? Is this a bug with OpenMPI-5.0.6?


Thanks
To unsubscribe from this group and stop receiving emails from it, send 
an email to users+unsubscr...@lists.open-mpi.org.



Hi Sangam,

the "-x" option propagate your LD_LIBRARY_PATH as it is set on the 
execution node. So may be you need only to set "-x LD_LIBRARY_PATH" 
after sourcing your vars.sh in your PBS script ?


Patrick (not using PBS but Slurm, sorry)

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.



[OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Sangam B
Hi,

OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1 compilers. 
An mpi program is compiled with this openmpi-5.0.6. 

While submitting job thru PBS on a Linux cluster, the intel compilers is 
sourced and the same is passed thru OpenMPI's mpirun command option: " -x 
LD_LIBRARY_PATH= ". But still the job fails 
with following error: 

prted: error while loading shared libraries: libimf.so: cannot open shared 
object file: No such file or directory

PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
  Remote daemon: [prterun-cn19-2146925@0,2] on node cn21

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

However, if put "source vars.sh" in the ~/.bashrc, 
then job works fine. But this is not the right way to do so.

But my question here is that, after passing -x LD_LIBRARY_PATH to mpirun 
command, why it is not able to find the "libimf.so" on all the nodes? Is 
this a bug with OpenMPI-5.0.6?

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Gilles Gouaillardet
Sangam,

-x LD_LIBRARY_PATH won't do the trick here.

mpirun spawns prted daemons on the other nodes (via the tm interface or
whatever the latest PBS uses if support was built into Open MPI, or SSH
otherwise), and the daemons fail to start because the intel runtime cannot
be found.
you can chrpath -l prted in order to check if prted was built with rpath.
if so, make sure the runtime is available at the same location.


An other option is to rebuild prrte with gcc compilers so it does not
depend on the Intel runtime.

Cheers,

Gilles

On Sat, Feb 15, 2025 at 3:00 AM Sangam B  wrote:

> Hi Patrick,
>
> Thanks for your reply.
> Ofcourse, the intel vars.sh is sourced inside the pbs script and I've
> tried multiple ways to resolve this issue:
>
> -x LD_LIBRARY_PATH
> &
> -x
> LD_LIBRARY_PATH=/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}
>
> And then copied the libimf.so to job's working directory and set
> -x
> LD_LIBRARY_PATH=.:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}
>
> But in any of the case it didn't work
>
> On Fri, Feb 14, 2025 at 6:30 PM Patrick Begou <
> patrick.be...@univ-grenoble-alpes.fr> wrote:
>
>> Le 14/02/2025 à 13:22, Sangam B a écrit :
>> > Hi,
>> >
>> > OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1
>> > compilers. An mpi program is compiled with this openmpi-5.0.6.
>> >
>> > While submitting job thru PBS on a Linux cluster, the intel compilers
>> > is sourced and the same is passed thru OpenMPI's mpirun command
>> > option: " -x LD_LIBRARY_PATH= ". But
>> > still the job fails with following error:
>> >
>> > prted: error while loading shared libraries: libimf.so: cannot open
>> > shared object file: No such file or directory
>> >
>> > PRTE has lost communication with a remote daemon.
>> >
>> >   HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
>> >   Remote daemon: [prterun-cn19-2146925@0,2] on node cn21
>> >
>> > This is usually due to either a failure of the TCP network
>> > connection to the node, or possibly an internal failure of
>> > the daemon itself. We cannot recover from this failure, and
>> > therefore will terminate the job.
>> >
>> > However, if put "source vars.sh" in the
>> > ~/.bashrc, then job works fine. But this is not the right way to do so.
>> >
>> > But my question here is that, after passing -x LD_LIBRARY_PATH to
>> > mpirun command, why it is not able to find the "libimf.so" on all the
>> > nodes? Is this a bug with OpenMPI-5.0.6?
>> >
>> > Thanks
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to users+unsubscr...@lists.open-mpi.org.
>>
>>
>> Hi Sangam,
>>
>> the "-x" option propagate your LD_LIBRARY_PATH as it is set on the
>> execution node. So may be you need only to set "-x LD_LIBRARY_PATH"
>> after sourcing your vars.sh in your PBS script ?
>>
>> Patrick (not using PBS but Slurm, sorry)
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to users+unsubscr...@lists.open-mpi.org.
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Patrick Begou

Hi Sangam

could you check that the install location of the library is the same on 
all the nodes ?  May be checking LD_LIBRARY_PATH after sourcing the 
intel vars.sh file ?

I'm using OpenMPI 5.0.6 but in a Slurm context and it works fine.

Patrick

Le 14/02/2025 à 19:00, Sangam B a écrit :

Hi Patrick,

Thanks for your reply.
Ofcourse, the intel vars.sh is sourced inside the pbs script and I've 
tried multiple ways to resolve this issue:


-x LD_LIBRARY_PATH
&
-x 
LD_LIBRARY_PATH=/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}


And then copied the libimf.so to job's working directory and set
-x 
LD_LIBRARY_PATH=.:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}


But in any of the case it didn't work

On Fri, Feb 14, 2025 at 6:30 PM Patrick Begou 
 wrote:


Le 14/02/2025 à 13:22, Sangam B a écrit :
> Hi,
>
> OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1
> compilers. An mpi program is compiled with this openmpi-5.0.6.
>
> While submitting job thru PBS on a Linux cluster, the intel
compilers
> is sourced and the same is passed thru OpenMPI's mpirun command
> option: " -x LD_LIBRARY_PATH= ". But
> still the job fails with following error:
>
> prted: error while loading shared libraries: libimf.so: cannot open
> shared object file: No such file or directory
>
> PRTE has lost communication with a remote daemon.
>
>   HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
>   Remote daemon: [prterun-cn19-2146925@0,2] on node cn21
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
>
> However, if put "source vars.sh" in the
> ~/.bashrc, then job works fine. But this is not the right way to
do so.
>
> But my question here is that, after passing -x LD_LIBRARY_PATH to
> mpirun command, why it is not able to find the "libimf.so" on
all the
> nodes? Is this a bug with OpenMPI-5.0.6?
>
> Thanks
> To unsubscribe from this group and stop receiving emails from
it, send
> an email to users+unsubscr...@lists.open-mpi.org
.


Hi Sangam,

the "-x" option propagate your LD_LIBRARY_PATH as it is set on the
execution node. So may be you need only to set "-x LD_LIBRARY_PATH"
after sourcing your vars.sh in your PBS
script ?

Patrick (not using PBS but Slurm, sorry)

To unsubscribe from this group and stop receiving emails from it,
send an email to users+unsubscr...@lists.open-mpi.org
.

To unsubscribe from this group and stop receiving emails from it, send 
an email to users+unsubscr...@lists.open-mpi.org.


To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


[OMPI users] Disable PMPI bindings?

2025-02-14 Thread Joshua Strodtbeck

Hi everyone,

I am trying to use Open MPI built with IBM's Open XLF 17.x, and I get
compile-time errors in the application (WRF) due to an apparent mismatch
between the PMPI argument list and what was actually compiled into the
module, e.g.:

"mpif-sizeof.h", line 2463.6: 1514-699 (S) Procedure
"pmpi_sizeof_real32_r6" must have a nonoptional dummy argument that
corresponds by position in the argument list to a dummy argument not
present in procedure "pmpi_sizeof_real32_r6", present and type
incompatible, present with different kind type parameters, or present
with a different rank.

Open XLF is generally a lot stricter about the Fortran standards than
gfortran or ifort. Is it possible to disable PMPI bindings at compile
time so they don't appear at all? I am also happy to triage the bug,
since I have a Power10 system and the Open XLF compiler.

-JPS

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.



Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Sangam B
Hi Patrick,

Thanks for your reply.
Ofcourse, the intel vars.sh is sourced inside the pbs script and I've tried
multiple ways to resolve this issue:

-x LD_LIBRARY_PATH
&
-x
LD_LIBRARY_PATH=/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}

And then copied the libimf.so to job's working directory and set
-x
LD_LIBRARY_PATH=.:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}

But in any of the case it didn't work

On Fri, Feb 14, 2025 at 6:30 PM Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> wrote:

> Le 14/02/2025 à 13:22, Sangam B a écrit :
> > Hi,
> >
> > OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1
> > compilers. An mpi program is compiled with this openmpi-5.0.6.
> >
> > While submitting job thru PBS on a Linux cluster, the intel compilers
> > is sourced and the same is passed thru OpenMPI's mpirun command
> > option: " -x LD_LIBRARY_PATH= ". But
> > still the job fails with following error:
> >
> > prted: error while loading shared libraries: libimf.so: cannot open
> > shared object file: No such file or directory
> >
> > PRTE has lost communication with a remote daemon.
> >
> >   HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
> >   Remote daemon: [prterun-cn19-2146925@0,2] on node cn21
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> >
> > However, if put "source vars.sh" in the
> > ~/.bashrc, then job works fine. But this is not the right way to do so.
> >
> > But my question here is that, after passing -x LD_LIBRARY_PATH to
> > mpirun command, why it is not able to find the "libimf.so" on all the
> > nodes? Is this a bug with OpenMPI-5.0.6?
> >
> > Thanks
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to users+unsubscr...@lists.open-mpi.org.
>
>
> Hi Sangam,
>
> the "-x" option propagate your LD_LIBRARY_PATH as it is set on the
> execution node. So may be you need only to set "-x LD_LIBRARY_PATH"
> after sourcing your vars.sh in your PBS script ?
>
> Patrick (not using PBS but Slurm, sorry)
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Patrick Begou

Bad answer, sorry I did not managed prted was part of OpenMPI stack.

Le 14/02/2025 à 19:19, Patrick Begou a écrit :

Hi Sangam

could you check that the install location of the library is the same 
on all the nodes ?  May be checking LD_LIBRARY_PATH after sourcing the 
intel vars.sh file ?

I'm using OpenMPI 5.0.6 but in a Slurm context and it works fine.

Patrick

Le 14/02/2025 à 19:00, Sangam B a écrit :

Hi Patrick,

Thanks for your reply.
Ofcourse, the intel vars.sh is sourced inside the pbs script and I've 
tried multiple ways to resolve this issue:


-x LD_LIBRARY_PATH
&
-x 
LD_LIBRARY_PATH=/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}


And then copied the libimf.so to job's working directory and set
-x 
LD_LIBRARY_PATH=.:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}


But in any of the case it didn't work

On Fri, Feb 14, 2025 at 6:30 PM Patrick Begou 
 wrote:


Le 14/02/2025 à 13:22, Sangam B a écrit :
> Hi,
>
> OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1
> compilers. An mpi program is compiled with this openmpi-5.0.6.
>
> While submitting job thru PBS on a Linux cluster, the intel
compilers
> is sourced and the same is passed thru OpenMPI's mpirun command
> option: " -x LD_LIBRARY_PATH= ". But
> still the job fails with following error:
>
> prted: error while loading shared libraries: libimf.so: cannot
open
> shared object file: No such file or directory
>
> PRTE has lost communication with a remote daemon.
>
>   HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
>   Remote daemon: [prterun-cn19-2146925@0,2] on node cn21
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
>
> However, if put "source vars.sh" in the
> ~/.bashrc, then job works fine. But this is not the right way
to do so.
>
> But my question here is that, after passing -x LD_LIBRARY_PATH to
> mpirun command, why it is not able to find the "libimf.so" on
all the
> nodes? Is this a bug with OpenMPI-5.0.6?
>
> Thanks
> To unsubscribe from this group and stop receiving emails from
it, send
> an email to users+unsubscr...@lists.open-mpi.org
.


Hi Sangam,

the "-x" option propagate your LD_LIBRARY_PATH as it is set on the
execution node. So may be you need only to set "-x LD_LIBRARY_PATH"
after sourcing your vars.sh in your PBS
script ?

Patrick (not using PBS but Slurm, sorry)

To unsubscribe from this group and stop receiving emails from it,
send an email to users+unsubscr...@lists.open-mpi.org
.

To unsubscribe from this group and stop receiving emails from it, 
send an email to users+unsubscr...@lists.open-mpi.org.



To unsubscribe from this group and stop receiving emails from it, send 
an email to users+unsubscr...@lists.open-mpi.org.


To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.


Re: [OMPI users] OpenMPI-5.0.6: -x LD_LIBRARY_PATH not able to load shared objects

2025-02-14 Thread Sangam B
Thanks Gilles & Patrick.

As Gilles mentioned, while OpenMPI spawns prted daemons on compute nodes,
it fails to get launched, because Intel runtime is not available.

To resolve this issue, I loaded the Intel runtime before job submission on
the terminal session and used #PBS -V in the job script.
Thus it got resolved.

Other solutions can be:
(1) If OpenMPI is built with intel compilers, then use a static build [
link the intel libs statically].
(2) Or Build Open MPI with gcc compilers [OS default] and use OMPI_CC=icc
etc

Thanks



On Fri, Feb 14, 2025 at 11:53 PM Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> wrote:

> Bad answer, sorry I did not managed prted was part of OpenMPI stack.
>
> Le 14/02/2025 à 19:19, Patrick Begou a écrit :
>
> Hi Sangam
>
> could you check that the install location of the library is the same on
> all the nodes ?  May be checking LD_LIBRARY_PATH after sourcing the intel
> vars.sh file ?
> I'm using OpenMPI 5.0.6 but in a Slurm context and it works fine.
>
> Patrick
>
> Le 14/02/2025 à 19:00, Sangam B a écrit :
>
> Hi Patrick,
>
> Thanks for your reply.
> Ofcourse, the intel vars.sh is sourced inside the pbs script and I've
> tried multiple ways to resolve this issue:
>
> -x LD_LIBRARY_PATH
> &
> -x
> LD_LIBRARY_PATH=/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}
>
> And then copied the libimf.so to job's working directory and set
> -x
> LD_LIBRARY_PATH=.:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/opt/compiler/lib:/opt/intel/oneapi/2024/v2.1/compiler/2024.2/lib:${LD_LIBRARY_PATH}
>
> But in any of the case it didn't work
>
> On Fri, Feb 14, 2025 at 6:30 PM Patrick Begou <
> patrick.be...@univ-grenoble-alpes.fr> wrote:
>
>> Le 14/02/2025 à 13:22, Sangam B a écrit :
>> > Hi,
>> >
>> > OpenMPI-5.0.6 is compiled with ucx-1.18 and Intel 1api 2024 v2.1
>> > compilers. An mpi program is compiled with this openmpi-5.0.6.
>> >
>> > While submitting job thru PBS on a Linux cluster, the intel compilers
>> > is sourced and the same is passed thru OpenMPI's mpirun command
>> > option: " -x LD_LIBRARY_PATH= ". But
>> > still the job fails with following error:
>> >
>> > prted: error while loading shared libraries: libimf.so: cannot open
>> > shared object file: No such file or directory
>> >
>> > PRTE has lost communication with a remote daemon.
>> >
>> >   HNP daemon   : [prterun-cn19-2146925@0,0] on node cn19
>> >   Remote daemon: [prterun-cn19-2146925@0,2] on node cn21
>> >
>> > This is usually due to either a failure of the TCP network
>> > connection to the node, or possibly an internal failure of
>> > the daemon itself. We cannot recover from this failure, and
>> > therefore will terminate the job.
>> >
>> > However, if put "source vars.sh" in the
>> > ~/.bashrc, then job works fine. But this is not the right way to do so.
>> >
>> > But my question here is that, after passing -x LD_LIBRARY_PATH to
>> > mpirun command, why it is not able to find the "libimf.so" on all the
>> > nodes? Is this a bug with OpenMPI-5.0.6?
>> >
>> > Thanks
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to users+unsubscr...@lists.open-mpi.org.
>>
>>
>> Hi Sangam,
>>
>> the "-x" option propagate your LD_LIBRARY_PATH as it is set on the
>> execution node. So may be you need only to set "-x LD_LIBRARY_PATH"
>> after sourcing your vars.sh in your PBS script ?
>>
>> Patrick (not using PBS but Slurm, sorry)
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to users+unsubscr...@lists.open-mpi.org.
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.