All,

I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI
1.10.3 working on the Discover cluster. What was happening was I'd run our
climate model at, say, 4x24 and it would work sometimes. Most of the time.
Every once in a while, it'd throw a segfault. If we changed the layout or
number of processors, more (and sometimes different) segfaults are trigger.

As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built
exactly the same) and run perfectly, I was focusing on the Open MPI build.
I tried compiling it at -O3, -O, -O0, all sorts of things and was about to
throw in the towel as all failed.

But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest
before reporting to the mailing list. I built it and, huzzah!, it works!
I'm happy! Except that every time I execute 'mpirun' I get odd errors:

(1034) $ mpirun -np 4 ./helloWorld.mpi2.exe
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgr074
  Local device: mlx5_0
--------------------------------------------------------------------------
[borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
[borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error
obtaining device attributes for mlx5_0 errno says Cannot allocate memory
MPI Version: 3.1
MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239
Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016
Process    0 of    4 is on borgr074
Process    3 of    4 is on borgr074
Process    1 of    4 is on borgr074
Process    2 of    4 is on borgr074
[borgr074:29032] 3 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

If I run with --mca btl_base_verbose 1 and use more than one node, I see
that the openib/verbs (still not sure what to call this) btl isn't being
used, but rather tcp:

[borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node
borgr074
[borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node
borgr074

which makes sense since it can't find an Infiniband device.

My first thought is that the build/configure procedure of the past doesn't
quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as:

export CC=pgcc
export CXX=pgc++
export FC=pgfortran

export CFLAGS="-fpic -m64"
export CXXFLAGS="-fpic -m64"
export FCFLAGS="-m64 -fpic"
export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64
export LDFLAGS="-L/usr/slurm/lib64"
export CPPFLAGS="-I/usr/slurm/include"

export LIBS="-lpciaccess"

build() {
  echo `pwd`
  ./configure --with-slurm --disable-wrapper-rpath --enable-shared
--prefix=${PREFIX}
  make -j8
  make install
}

echo "calling build"
build
echo "exiting"

This is a build script built over time; it might have things unnecessary
for an Open MPI 2.0 build, but perhaps now it needs more info? I can say
that in the past (say with 1.10.3) it definitely found the openib/verbs btl
and used it!

Per the website, I'm attaching links to my config.log and "ompi_info --all"
information:

https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz
https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz
https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz

I tried to run "ompi_info -v ompi full --parsable" as asked but that
doesn't seem possible anymore:

(1053) $ ompi_info -v ompi full --parsable
ompi_info: Error: unknown option "-v"
Type 'ompi_info --help' for usage.

I am asking our machine gurus about the Infiniband network per:
https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
-- 
Matt Thompson

Man Among Men
Fulcrum of History

Reply via email to