Well,

By turning off UCX compilation per Howard, things get a bit better in that
something happens! It's not a good something, as it seems to die with an
infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing
libverbs somewhere and compiling it in? To wit:

(1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

  Local host:              borgc129
  Local adapter:           hfi1_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   borgc129
  Local device: hfi1_0
--------------------------------------------------------------------------
Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications
running on Intel(R) 64, Version 18.0.5.274 Build 20180823
MPI Version: 3.1
MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23
Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018
[borgc129:260830] *** An error occurred in MPI_Barrier
[borgc129:260830] *** reported by process [140736833716225,46909632806913]
[borgc129:260830] *** on communicator MPI_COMM_WORLD
[borgc129:260830] *** MPI_ERR_OTHER: known error not in list
[borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[borgc129:260830] ***    and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
helloWorld.mpi3.S  000000000040A38E  for__signal_handl     Unknown  Unknown
libpthread-2.22.s  00002AAAAB9CCB20  Unknown               Unknown  Unknown
libpthread-2.22.s  00002AAAAB9C90CD  pthread_cond_wait     Unknown  Unknown
libpmix.so.2.1.11  00002AAAB1D780A1  PMIx_Abort            Unknown  Unknown
mca_pmix_ext2x.so  00002AAAB1B3AA75  ext2x_abort           Unknown  Unknown
mca_ess_pmi.so     00002AAAB1724BC0  Unknown               Unknown  Unknown
libopen-rte.so.40  00002AAAAC3E941C  orte_errmgr_base_     Unknown  Unknown
mca_errmgr_defaul  00002AAABC401668  Unknown               Unknown  Unknown
libmpi.so.40.20.0  00002AAAAB3CDBC4  ompi_mpi_abort        Unknown  Unknown
libmpi.so.40.20.0  00002AAAAB3BB1EF  ompi_mpi_errors_a     Unknown  Unknown
libmpi.so.40.20.0  00002AAAAB3B99C9  ompi_errhandler_i     Unknown  Unknown
libmpi.so.40.20.0  00002AAAAB3E4576  MPI_Barrier           Unknown  Unknown
libmpi_mpifh.so.4  00002AAAAB15EE53  MPI_Barrier_f08       Unknown  Unknown
libmpi_usempif08.  00002AAAAACE7732  mpi_barrier_f08_      Unknown  Unknown
helloWorld.mpi3.S  000000000040939F  Unknown               Unknown  Unknown
helloWorld.mpi3.S  000000000040915E  Unknown               Unknown  Unknown
libc-2.22.so       00002AAAABBF96D5  __libc_start_main     Unknown  Unknown
helloWorld.mpi3.S  0000000000409069  Unknown               Unknown  Unknown

On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard <hpprit...@gmail.com>
wrote:

> Hi Matt
>
> Definitely do not include the ucx option for an omnipath cluster.
> Actually if you accidentally installed ucx in it’s default location use on
> the system Switch to this config option
>
> —with-ucx=no
>
> Otherwise you will hit
>
> https://github.com/openucx/ucx/issues/750
>
> Howard
>
>
> Gilles Gouaillardet <gilles.gouaillar...@gmail.com> schrieb am Sa. 19.
> Jan. 2019 um 18:41:
>
>> Matt,
>>
>> There are two ways of using PMIx
>>
>> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
>> to mpirun and orted daemons (e.g. the PMIx server)
>> - if you use SLURM srun, then the MPI app will directly talk to the
>> PMIx server provided by SLURM. (note you might have to srun
>> --mpi=pmix_v2 or something)
>>
>> In the former case, it does not matter whether you use the embedded or
>> external PMIx.
>> In the latter case, Open MPI and SLURM have to use compatible PMIx
>> libraries, and you can either check the cross-version compatibility
>> matrix,
>> or build Open MPI with the same PMIx used by SLURM to be on the safe
>> side (not a bad idea IMHO).
>>
>>
>> Regarding the hang, I suggest you try different things
>> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
>> runs on a compute node rather than on a frontend node)
>> - try something even simpler such as mpirun hostname (both with sbatch
>> and salloc)
>> - explicitly specify the network to be used for the wire-up. you can
>> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
>> the network subnet by which all the nodes (e.g. compute nodes and
>> frontend node if you use salloc) communicate.
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson <fort...@gmail.com> wrote:
>> >
>> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
>> users@lists.open-mpi.org> wrote:
>> >>
>> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson <fort...@gmail.com> wrote:
>> >> >
>> >> > With some help, I managed to build an Open MPI 4.0.0 with:
>> >>
>> >> We can discuss each of these params to let you know what they are.
>> >>
>> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>> >>
>> >> Did you have a reason for disabling these?  They're generally good
>> things.  What they do is add linker flags to the wrapper compilers (i.e.,
>> mpicc and friends) that basically put a default path to find libraries at
>> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
>> can override these linked-in-default-paths if you want/need to).
>> >
>> >
>> > I've had these in my Open MPI builds for a while now. The reason was
>> one of the libraries I need for the climate model I work on went nuts if
>> both of them weren't there. It was originally the rpath one but then
>> eventually (Open MPI 3?) I had to add the runpath one. But I have been
>> updating the libraries more aggressively recently (due to OS upgrades) so
>> it's possible this is no longer needed.
>> >
>> >>
>> >>
>> >> > --with-psm2
>> >>
>> >> Ensure that Open MPI can include support for the PSM2 library, and
>> abort configure if it cannot.
>> >>
>> >> > --with-slurm
>> >>
>> >> Ensure that Open MPI can include support for SLURM, and abort
>> configure if it cannot.
>> >>
>> >> > --enable-mpi1-compatibility
>> >>
>> >> Add support for MPI_Address and other MPI-1 functions that have since
>> been deleted from the MPI 3.x specification.
>> >>
>> >> > --with-ucx
>> >>
>> >> Ensure that Open MPI can include support for UCX, and abort configure
>> if it cannot.
>> >>
>> >> > --with-pmix=/usr/nlocal/pmix/2.1
>> >>
>> >> Tells Open MPI to use the PMIx that is installed at
>> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally
>> to Open MPI's source code tree/expanded tarball).
>> >>
>> >> Unless you have a reason to use the external PMIx, the
>> internal/bundled PMIx is usually sufficient.
>> >
>> >
>> > Ah. I did not know that. I figured if our SLURM was built linked to a
>> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
>> build an Open MPI 4 without specifying this.
>> >
>> >>
>> >>
>> >> > --with-libevent=/usr
>> >>
>> >> Same as previous; change "pmix" to "libevent" (i.e., use the external
>> libevent instead of the bundled libevent).
>> >>
>> >> > CC=icc CXX=icpc FC=ifort
>> >>
>> >> Specify the exact compilers to use.
>> >>
>> >> > The MPI 1 is because I need to build HDF5 eventually and I added
>> psm2 because it's an Omnipath cluster. The libevent was probably a red
>> herring as libevent-devel wasn't installed on the system. It was
>> eventually, and I just didn't remove the flag. And I saw no errors in the
>> build!
>> >>
>> >> Might as well remove the --with-libevent if you don't need it.
>> >>
>> >> > However, I seem to have built an Open MPI that doesn't work:
>> >> >
>> >> > (1099)(master) $ mpirun --version
>> >> > mpirun (Open MPI) 4.0.0
>> >> >
>> >> > Report bugs to http://www.open-mpi.org/community/help/
>> >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>> >> >
>> >> > It just sits there...forever. Can the gurus here help me figure out
>> what I managed to break? Perhaps I added too much to my configure line? Not
>> enough?
>> >>
>> >> There could be a few things going on here.
>> >>
>> >> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an
>> "sbatch" script?
>> >
>> >
>> > I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just
>> fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI
>> is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a
>> few months ago), and it had some interesting startup scaling I liked (slow
>> at low core count, but getting close to Intel MPI at high core count),
>> though it seemed to not work after about 100 nodes (4000 processes) or so.
>> >
>> > --
>> > Matt Thompson
>> >    “The fact is, this is about us identifying what we do best and
>> >    finding more ways of doing less of it better” -- Director of Better
>> Anna Rampton
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to