Open MPI List,

I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our
cluster (reported on this list), and decided to try it with 1.8.2. However,
we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder,
Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout
with Open MPI 1.8.2. That is, HelloWorld doesn't work.

To wit, our sysadmin has two tarballs:

(1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
(1442) $ sha1sum openmpi-1.8.2.tar.gz
cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz

I then build each with a script in the method our sysadmin usually does:

#!/bin/sh
> set -x
> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
> build() {
>   echo `pwd`
>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared
> --enable-mca-no-build=btl-usnic \
>       CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>       CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC
> -m64" FFLAGS="-mtune=generic -fPIC -m64" \
>       F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC
> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
>       LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
>      --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
>   make 2>&1 | tee make.1.8.2.log
>   make check 2>&1 | tee makecheck.1.8.2.log
>   make install 2>&1 | tee makeinstall.1.8.2.log
> }
> echo "calling build"
> build
> echo "exiting"


The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX
and log file tees.  Now, let us test. First, I grab some nodes with slurm:

$ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00
> --account=g0620 --mail-type=BEGIN


Once I get my nodes, I run with 1.8.2rc4:

(1142) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o
> helloWorld.182rc4.x helloWorld.F90
> (1143) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182rc4.x
> Process    0 of    8 is on borg01w044
> Process    5 of    8 is on borg01w044
> Process    3 of    8 is on borg01w044
> Process    7 of    8 is on borg01w044
> Process    1 of    8 is on borg01w044
> Process    2 of    8 is on borg01w044
> Process    4 of    8 is on borg01w044
> Process    6 of    8 is on borg01w044


Now 1.8.2:

(1144) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o
> helloWorld.182.x helloWorld.F90
> (1145) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8
> ./helloWorld.182.x
> (1146) $


No output at all. But, if I take the helloWorld.x from 1.8.2 and run it
with 1.8.2rc4's mpirun:

(1146) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182.x
> Process    5 of    8 is on borg01w044
> Process    7 of    8 is on borg01w044
> Process    2 of    8 is on borg01w044
> Process    4 of    8 is on borg01w044
> Process    1 of    8 is on borg01w044
> Process    3 of    8 is on borg01w044
> Process    6 of    8 is on borg01w044
> Process    0 of    8 is on borg01w044


So...any idea what is happening here? There did seem to be a few SLURM
related changes between the two tarballs involving /dev/null but it's a bit
above me to decipher.

You can find the ompi_info, build, make, config, etc logs at these links
(they are ~300kB which is over the mailing list limit according to the Open
MPI web page):

https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2

Thank you for any help and please let me know if you need more information,
Matt

-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Reply via email to