Just FYI: on master (and perhaps 4.0), child jobs do not inherit their parent's mapping policy by default. You have to add “-mca rmaps_base_inherit 1” to your mpirun cmd line.
> On Oct 6, 2018, at 10:00 AM, Andrew Benson <aben...@carnegiescience.edu> > wrote: > > Thanks, I'll try this right away. > > Thanks, > Andrew > > > -- > > * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html > <http://users.obs.carnegiescience.edu/abenson/contact.html> > > * Galacticus: http://sites.google.com/site/galacticusmodel > <http://sites.google.com/site/galacticusmodel> > On Sat, Oct 6, 2018, 9:02 AM Ralph H Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Sorry for delay - this should be fixed by > https://github.com/open-mpi/ompi/pull/5854 > <https://github.com/open-mpi/ompi/pull/5854> > > > On Sep 19, 2018, at 8:00 AM, Andrew Benson <abenso...@gmail.com > > <mailto:abenso...@gmail.com>> wrote: > > > > On further investigation removing the "preconnect_all" option does change > > the > > problem at least. Without "preconnect_all" I no longer see: > > > > -------------------------------------------------------------------------- > > At least one pair of MPI processes are unable to reach each other for > > MPI communications. This means that no Open MPI device has indicated > > that it can be used to communicate between these processes. This is > > an error; Open MPI requires that all MPI processes be able to reach > > each other. This error can sometimes be the result of forgetting to > > specify the "self" BTL. > > > > Process 1 ([[32179,2],15]) is on host: node092 > > Process 2 ([[32179,2],0]) is on host: unknown! > > BTLs attempted: self tcp vader > > > > Your MPI job is now going to abort; sorry. > > -------------------------------------------------------------------------- > > > > > > Instead it hangs for several minutes and finally aborts with: > > > > -------------------------------------------------------------------------- > > A request has timed out and will therefore fail: > > > > Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345 > > > > Your job may terminate as a result of this problem. You may want to > > adjust the MCA parameter pmix_server_max_wait and try again. If this > > occurred during a connect/accept operation, you can adjust that time > > using the pmix_base_exchange_timeout parameter. > > -------------------------------------------------------------------------- > > [node091:19470] *** An error occurred in MPI_Comm_spawn > > [node091:19470] *** reported by process [1614086145,0] > > [node091:19470] *** on communicator MPI_COMM_WORLD > > [node091:19470] *** MPI_ERR_UNKNOWN: unknown error > > [node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > > will > > now abort, > > [node091:19470] *** and potentially your MPI job) > > > > I've tried increasing both pmix_server_max_wait and > > pmix_base_exchange_timeout > > as suggested in the error message, but the result is unchanged (it just > > takes > > longer to time out). > > > > Once again, if I remove "--map-by node" it runs successfully. > > > > -Andrew > > > > > > > > On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote: > >> I see you are using “preconnect_all” - that is the source of the trouble. I > >> don’t believe we have tested that option in years and the code is almost > >> certainly dead. I’d suggest removing that option and things should work. > >>> On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com > >>> <mailto:abenso...@gmail.com>> wrote: > >>> > >>> I'm running into problems trying to spawn MPI processes across multiple > >>> nodes on a cluster using recent versions of OpenMPI. Specifically, using > >>> the attached Fortan code, compiled using OpenMPI 3.1.2 with: > >>> > >>> mpif90 test.F90 -o test.exe > >>> > >>> and run via a PBS scheduler using the attached test1.pbs, it fails as can > >>> be seen in the attached testFAIL.err file. > >>> > >>> If I do the same but using OpenMPI v1.10.3 then it works successfully, > >>> giving me the output in the attached testSUCCESS.err file. > >>> > >>> From testing a few different versions of OpenMPI it seems that the > >>> behavior > >>> changed between v1.10.7 and v2.0.4. > >>> > >>> Is there some change in options needed to make this work with newer > >>> OpenMPIs? > >>> > >>> Output from omp_info --all is attached. config.log can be found here: > >>> > >>> http://users.obs.carnegiescience.edu/abenson/config.log.bz2 > >>> <http://users.obs.carnegiescience.edu/abenson/config.log.bz2> > >>> > >>> Thanks for any help you can offer! > >>> > >>> -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC > >>> ESS.err.bz2>_______________________________________________ users mailing > >>> list > >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > >>> https://lists.open-mpi.org/mailman/listinfo/users > >>> <https://lists.open-mpi.org/mailman/listinfo/users> > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > >> https://lists.open-mpi.org/mailman/listinfo/users > >> <https://lists.open-mpi.org/mailman/listinfo/users> > > > > > > -- > > > > * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html > > <http://users.obs.carnegiescience.edu/abenson/contact.html> > > > > * Galacticus: https://bitbucket.org/abensonca/galacticus > > <https://bitbucket.org/abensonca/galacticus> > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users