Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

Ralph H Castain Sat, 06 Oct 2018 10:06:01 -0700

Just FYI: on master (and perhaps 4.0), child jobs do not inherit their parent's 
mapping policy by default. You have to add “-mca rmaps_base_inherit 1” to your 
mpirun cmd line.



> On Oct 6, 2018, at 10:00 AM, Andrew Benson <aben...@carnegiescience.edu> 
> wrote:
> 
> Thanks, I'll try this right away.
> 
> Thanks,
> Andrew
> 
> 
> --
> 
> * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html 
> <http://users.obs.carnegiescience.edu/abenson/contact.html>
> 
> * Galacticus: http://sites.google.com/site/galacticusmodel 
> <http://sites.google.com/site/galacticusmodel>
> On Sat, Oct 6, 2018, 9:02 AM Ralph H Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Sorry for delay - this should be fixed by 
> https://github.com/open-mpi/ompi/pull/5854 
> <https://github.com/open-mpi/ompi/pull/5854>
> 
> > On Sep 19, 2018, at 8:00 AM, Andrew Benson <abenso...@gmail.com 
> > <mailto:abenso...@gmail.com>> wrote:
> > 
> > On further investigation removing the "preconnect_all" option does change 
> > the 
> > problem at least. Without "preconnect_all" I no longer see:
> > 
> > --------------------------------------------------------------------------
> > At least one pair of MPI processes are unable to reach each other for
> > MPI communications.  This means that no Open MPI device has indicated
> > that it can be used to communicate between these processes.  This is
> > an error; Open MPI requires that all MPI processes be able to reach
> > each other.  This error can sometimes be the result of forgetting to
> > specify the "self" BTL.
> > 
> >  Process 1 ([[32179,2],15]) is on host: node092
> >  Process 2 ([[32179,2],0]) is on host: unknown!
> >  BTLs attempted: self tcp vader
> > 
> > Your MPI job is now going to abort; sorry.
> > --------------------------------------------------------------------------
> > 
> > 
> > Instead it hangs for several minutes and finally aborts with:
> > 
> > --------------------------------------------------------------------------
> > A request has timed out and will therefore fail:
> > 
> >  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345
> > 
> > Your job may terminate as a result of this problem. You may want to
> > adjust the MCA parameter pmix_server_max_wait and try again. If this
> > occurred during a connect/accept operation, you can adjust that time
> > using the pmix_base_exchange_timeout parameter.
> > --------------------------------------------------------------------------
> > [node091:19470] *** An error occurred in MPI_Comm_spawn
> > [node091:19470] *** reported by process [1614086145,0]
> > [node091:19470] *** on communicator MPI_COMM_WORLD
> > [node091:19470] *** MPI_ERR_UNKNOWN: unknown error
> > [node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> > will 
> > now abort,
> > [node091:19470] ***    and potentially your MPI job)
> > 
> > I've tried increasing both pmix_server_max_wait and 
> > pmix_base_exchange_timeout 
> > as suggested in the error message, but the result is unchanged (it just 
> > takes 
> > longer to time out).
> > 
> > Once again, if I remove "--map-by node" it runs successfully.
> > 
> > -Andrew
> > 
> > 
> > 
> > On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
> >> I see you are using “preconnect_all” - that is the source of the trouble. I
> >> don’t believe we have tested that option in years and the code is almost
> >> certainly dead. I’d suggest removing that option and things should work.
> >>> On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com 
> >>> <mailto:abenso...@gmail.com>> wrote:
> >>> 
> >>> I'm running into problems trying to spawn MPI processes across multiple
> >>> nodes on a cluster using recent versions of OpenMPI. Specifically, using
> >>> the attached Fortan code, compiled using OpenMPI 3.1.2 with:
> >>> 
> >>> mpif90 test.F90 -o test.exe
> >>> 
> >>> and run via a PBS scheduler using the attached test1.pbs, it fails as can
> >>> be seen in the attached testFAIL.err file.
> >>> 
> >>> If I do the same but using OpenMPI v1.10.3 then it works successfully,
> >>> giving me the output in the attached testSUCCESS.err file.
> >>> 
> >>> From testing a few different versions of OpenMPI it seems that the
> >>> behavior
> >>> changed between v1.10.7 and v2.0.4.
> >>> 
> >>> Is there some change in options needed to make this work with newer
> >>> OpenMPIs?
> >>> 
> >>> Output from omp_info --all is attached. config.log can be found here:
> >>> 
> >>> http://users.obs.carnegiescience.edu/abenson/config.log.bz2 
> >>> <http://users.obs.carnegiescience.edu/abenson/config.log.bz2>
> >>> 
> >>> Thanks for any help you can offer!
> >>> 
> >>> -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
> >>> ESS.err.bz2>_______________________________________________ users mailing
> >>> list
> >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> >>> https://lists.open-mpi.org/mailman/listinfo/users 
> >>> <https://lists.open-mpi.org/mailman/listinfo/users>
> >> 
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> >> https://lists.open-mpi.org/mailman/listinfo/users 
> >> <https://lists.open-mpi.org/mailman/listinfo/users>
> > 
> > 
> > -- 
> > 
> > * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html 
> > <http://users.obs.carnegiescience.edu/abenson/contact.html>
> > 
> > * Galacticus: https://bitbucket.org/abensonca/galacticus 
> > <https://bitbucket.org/abensonca/galacticus>
> > 
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

Reply via email to