All, I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI 1.10.3 working on the Discover cluster. What was happening was I'd run our climate model at, say, 4x24 and it would work sometimes. Most of the time. Every once in a while, it'd throw a segfault. If we changed the layout or number of processors, more (and sometimes different) segfaults are trigger.
As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built exactly the same) and run perfectly, I was focusing on the Open MPI build. I tried compiling it at -O3, -O, -O0, all sorts of things and was about to throw in the towel as all failed. But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest before reporting to the mailing list. I built it and, huzzah!, it works! I'm happy! Except that every time I execute 'mpirun' I get odd errors: (1034) $ mpirun -np 4 ./helloWorld.mpi2.exe -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: borgr074 Local device: mlx5_0 -------------------------------------------------------------------------- [borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory [borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error obtaining device attributes for mlx5_0 errno says Cannot allocate memory MPI Version: 3.1 MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239 Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016 Process 0 of 4 is on borgr074 Process 3 of 4 is on borgr074 Process 1 of 4 is on borgr074 Process 2 of 4 is on borgr074 [borgr074:29032] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init [borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages If I run with --mca btl_base_verbose 1 and use more than one node, I see that the openib/verbs (still not sure what to call this) btl isn't being used, but rather tcp: [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node borgr074 [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node borgr074 which makes sense since it can't find an Infiniband device. My first thought is that the build/configure procedure of the past doesn't quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as: export CC=pgcc export CXX=pgc++ export FC=pgfortran export CFLAGS="-fpic -m64" export CXXFLAGS="-fpic -m64" export FCFLAGS="-m64 -fpic" export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64 export LDFLAGS="-L/usr/slurm/lib64" export CPPFLAGS="-I/usr/slurm/include" export LIBS="-lpciaccess" build() { echo `pwd` ./configure --with-slurm --disable-wrapper-rpath --enable-shared --prefix=${PREFIX} make -j8 make install } echo "calling build" build echo "exiting" This is a build script built over time; it might have things unnecessary for an Open MPI 2.0 build, but perhaps now it needs more info? I can say that in the past (say with 1.10.3) it definitely found the openib/verbs btl and used it! Per the website, I'm attaching links to my config.log and "ompi_info --all" information: https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz I tried to run "ompi_info -v ompi full --parsable" as asked but that doesn't seem possible anymore: (1053) $ ompi_info -v ompi full --parsable ompi_info: Error: unknown option "-v" Type 'ompi_info --help' for usage. I am asking our machine gurus about the Infiniband network per: https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot -- Matt Thompson Man Among Men Fulcrum of History