Hi all,

I am observing some strange behaviour with a dynamically linked binary inside 
an sbatch job. This binary is, among others, compiled against the MPICH library 
- so when I do an „ldd“  I get

$ ldd /path/to/binary

        linux-vdso.so.1 =>  (0x00007ffd817c5000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ae4a3152000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae4a3356000)
        libmpi.so.12 => not found
        libmpifort.so.12 => not found
        libm.so.6 => /lib64/libm.so.6 (0x00002ae4a3572000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ae4a3874000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ae4a3a7c000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ae4a2f2e000)

showing me that it cannot find those shared objects as I have not loaded any 
modules into my environment, yet. (This is expected).

Now, if I allocate some resources and start an interactive slurm session via 
e.g. 

$ srun -N 1 -c 4 -t 10:00 --pty bash

and load the appropriate module (LMOD btw.) into my environment, e.g.

$ module load GCC/10.3.0
$ module load MPICH/3.4.2

and then again check the linked libraries, I get

$ ldd /path/to/binary

        linux-vdso.so.1 =>  (0x00007fffe3d2c000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b4f58b6d000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b4f58d71000)
        libmpi.so.12 => 
/Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpi.so.12
 (0x00002b4f58f8d000)
        libmpifort.so.12 => 
/Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpifort.so.12
 (0x00002b4f58977000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b4f59ee4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b4f5a1e6000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b4f5a3ee000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b4f58949000)

Now finding the correct paths to the libraries.

HOWEVER, I cannot reproduce this inside an sbatch job I submitted. When it 
checks for the shared libs via ldd, the paths to the MPI libraries are not 
found. The job script looks more or less like his

####################################################
#!/bin/bash
#SBATCH --partition admin
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00

module load GCC/10.3.0
module load MPICH/3.4.2

ldd /path/to/binary 
####################################################

So nothing too complicated. I tested this with other, self-compiled, binaries 
which all seem to work just fine. Unfortunately this is a closed source binary 
blob - so I cannot recompile.

One interesting thing is, when I do not load any environment modules, but just 
directly set the LD_LIBRARY_PATH variable to the correct path before calling 
ldd, i.e.

LD_LIBRARY_PATH=/Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib
 ldd /path/to/binary

it will work as intended - also in batch job. 


Can anyone make sense of this? Can there be something hard coded into the 
binary, preventing it from using an exported LD_LIBRARY_PATH? And why would it 
work interactively, but not in a batch job? 

Many thanks
Sebastian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to