Well, By turning off UCX compilation per Howard, things get a bit better in that something happens! It's not a good something, as it seems to die with an infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing libverbs somewhere and compiling it in? To wit:
(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: borgc129 Local adapter: hfi1_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: borgc129 Local device: hfi1_0 -------------------------------------------------------------------------- Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.5.274 Build 20180823 MPI Version: 3.1 MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018 [borgc129:260830] *** An error occurred in MPI_Barrier [borgc129:260830] *** reported by process [140736833716225,46909632806913] [borgc129:260830] *** on communicator MPI_COMM_WORLD [borgc129:260830] *** MPI_ERR_OTHER: known error not in list [borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [borgc129:260830] *** and potentially your MPI job) forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source helloWorld.mpi3.S 000000000040A38E for__signal_handl Unknown Unknown libpthread-2.22.s 00002AAAAB9CCB20 Unknown Unknown Unknown libpthread-2.22.s 00002AAAAB9C90CD pthread_cond_wait Unknown Unknown libpmix.so.2.1.11 00002AAAB1D780A1 PMIx_Abort Unknown Unknown mca_pmix_ext2x.so 00002AAAB1B3AA75 ext2x_abort Unknown Unknown mca_ess_pmi.so 00002AAAB1724BC0 Unknown Unknown Unknown libopen-rte.so.40 00002AAAAC3E941C orte_errmgr_base_ Unknown Unknown mca_errmgr_defaul 00002AAABC401668 Unknown Unknown Unknown libmpi.so.40.20.0 00002AAAAB3CDBC4 ompi_mpi_abort Unknown Unknown libmpi.so.40.20.0 00002AAAAB3BB1EF ompi_mpi_errors_a Unknown Unknown libmpi.so.40.20.0 00002AAAAB3B99C9 ompi_errhandler_i Unknown Unknown libmpi.so.40.20.0 00002AAAAB3E4576 MPI_Barrier Unknown Unknown libmpi_mpifh.so.4 00002AAAAB15EE53 MPI_Barrier_f08 Unknown Unknown libmpi_usempif08. 00002AAAAACE7732 mpi_barrier_f08_ Unknown Unknown helloWorld.mpi3.S 000000000040939F Unknown Unknown Unknown helloWorld.mpi3.S 000000000040915E Unknown Unknown Unknown libc-2.22.so 00002AAAABBF96D5 __libc_start_main Unknown Unknown helloWorld.mpi3.S 0000000000409069 Unknown Unknown Unknown On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard <hpprit...@gmail.com> wrote: > Hi Matt > > Definitely do not include the ucx option for an omnipath cluster. > Actually if you accidentally installed ucx in it’s default location use on > the system Switch to this config option > > —with-ucx=no > > Otherwise you will hit > > https://github.com/openucx/ucx/issues/750 > > Howard > > > Gilles Gouaillardet <gilles.gouaillar...@gmail.com> schrieb am Sa. 19. > Jan. 2019 um 18:41: > >> Matt, >> >> There are two ways of using PMIx >> >> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk >> to mpirun and orted daemons (e.g. the PMIx server) >> - if you use SLURM srun, then the MPI app will directly talk to the >> PMIx server provided by SLURM. (note you might have to srun >> --mpi=pmix_v2 or something) >> >> In the former case, it does not matter whether you use the embedded or >> external PMIx. >> In the latter case, Open MPI and SLURM have to use compatible PMIx >> libraries, and you can either check the cross-version compatibility >> matrix, >> or build Open MPI with the same PMIx used by SLURM to be on the safe >> side (not a bad idea IMHO). >> >> >> Regarding the hang, I suggest you try different things >> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun >> runs on a compute node rather than on a frontend node) >> - try something even simpler such as mpirun hostname (both with sbatch >> and salloc) >> - explicitly specify the network to be used for the wire-up. you can >> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is >> the network subnet by which all the nodes (e.g. compute nodes and >> frontend node if you use salloc) communicate. >> >> >> Cheers, >> >> Gilles >> >> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson <fort...@gmail.com> wrote: >> > >> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users < >> users@lists.open-mpi.org> wrote: >> >> >> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson <fort...@gmail.com> wrote: >> >> > >> >> > With some help, I managed to build an Open MPI 4.0.0 with: >> >> >> >> We can discuss each of these params to let you know what they are. >> >> >> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath >> >> >> >> Did you have a reason for disabling these? They're generally good >> things. What they do is add linker flags to the wrapper compilers (i.e., >> mpicc and friends) that basically put a default path to find libraries at >> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you >> can override these linked-in-default-paths if you want/need to). >> > >> > >> > I've had these in my Open MPI builds for a while now. The reason was >> one of the libraries I need for the climate model I work on went nuts if >> both of them weren't there. It was originally the rpath one but then >> eventually (Open MPI 3?) I had to add the runpath one. But I have been >> updating the libraries more aggressively recently (due to OS upgrades) so >> it's possible this is no longer needed. >> > >> >> >> >> >> >> > --with-psm2 >> >> >> >> Ensure that Open MPI can include support for the PSM2 library, and >> abort configure if it cannot. >> >> >> >> > --with-slurm >> >> >> >> Ensure that Open MPI can include support for SLURM, and abort >> configure if it cannot. >> >> >> >> > --enable-mpi1-compatibility >> >> >> >> Add support for MPI_Address and other MPI-1 functions that have since >> been deleted from the MPI 3.x specification. >> >> >> >> > --with-ucx >> >> >> >> Ensure that Open MPI can include support for UCX, and abort configure >> if it cannot. >> >> >> >> > --with-pmix=/usr/nlocal/pmix/2.1 >> >> >> >> Tells Open MPI to use the PMIx that is installed at >> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally >> to Open MPI's source code tree/expanded tarball). >> >> >> >> Unless you have a reason to use the external PMIx, the >> internal/bundled PMIx is usually sufficient. >> > >> > >> > Ah. I did not know that. I figured if our SLURM was built linked to a >> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll >> build an Open MPI 4 without specifying this. >> > >> >> >> >> >> >> > --with-libevent=/usr >> >> >> >> Same as previous; change "pmix" to "libevent" (i.e., use the external >> libevent instead of the bundled libevent). >> >> >> >> > CC=icc CXX=icpc FC=ifort >> >> >> >> Specify the exact compilers to use. >> >> >> >> > The MPI 1 is because I need to build HDF5 eventually and I added >> psm2 because it's an Omnipath cluster. The libevent was probably a red >> herring as libevent-devel wasn't installed on the system. It was >> eventually, and I just didn't remove the flag. And I saw no errors in the >> build! >> >> >> >> Might as well remove the --with-libevent if you don't need it. >> >> >> >> > However, I seem to have built an Open MPI that doesn't work: >> >> > >> >> > (1099)(master) $ mpirun --version >> >> > mpirun (Open MPI) 4.0.0 >> >> > >> >> > Report bugs to http://www.open-mpi.org/community/help/ >> >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe >> >> > >> >> > It just sits there...forever. Can the gurus here help me figure out >> what I managed to break? Perhaps I added too much to my configure line? Not >> enough? >> >> >> >> There could be a few things going on here. >> >> >> >> Are you running inside a SLURM job? E.g., in a "salloc" job, or in an >> "sbatch" script? >> > >> > >> > I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just >> fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI >> is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a >> few months ago), and it had some interesting startup scaling I liked (slow >> at low core count, but getting close to Intel MPI at high core count), >> though it seemed to not work after about 100 nodes (4000 processes) or so. >> > >> > -- >> > Matt Thompson >> > “The fact is, this is about us identifying what we do best and >> > finding more ways of doing less of it better” -- Director of Better >> Anna Rampton >> > _______________________________________________ >> > users mailing list >> > users@lists.open-mpi.org >> > https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users