Ok, thanks - that's good to know.

-Andrew


--

* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: http://sites.google.com/site/galacticusmodel

On Sat, Oct 6, 2018, 10:02 AM Ralph H Castain <r...@open-mpi.org> wrote:

> Just FYI: on master (and perhaps 4.0), child jobs do not inherit their
> parent's mapping policy by default. You have to add “-mca
> rmaps_base_inherit 1” to your mpirun cmd line.
>
>
> On Oct 6, 2018, at 10:00 AM, Andrew Benson <aben...@carnegiescience.edu>
> wrote:
>
> Thanks, I'll try this right away.
>
> Thanks,
> Andrew
>
>
> --
>
> * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html
>
> * Galacticus: http://sites.google.com/site/galacticusmodel
>
> On Sat, Oct 6, 2018, 9:02 AM Ralph H Castain <r...@open-mpi.org> wrote:
>
>> Sorry for delay - this should be fixed by
>> https://github.com/open-mpi/ompi/pull/5854
>>
>> > On Sep 19, 2018, at 8:00 AM, Andrew Benson <abenso...@gmail.com> wrote:
>> >
>> > On further investigation removing the "preconnect_all" option does
>> change the
>> > problem at least. Without "preconnect_all" I no longer see:
>> >
>> >
>> --------------------------------------------------------------------------
>> > At least one pair of MPI processes are unable to reach each other for
>> > MPI communications.  This means that no Open MPI device has indicated
>> > that it can be used to communicate between these processes.  This is
>> > an error; Open MPI requires that all MPI processes be able to reach
>> > each other.  This error can sometimes be the result of forgetting to
>> > specify the "self" BTL.
>> >
>> >  Process 1 ([[32179,2],15]) is on host: node092
>> >  Process 2 ([[32179,2],0]) is on host: unknown!
>> >  BTLs attempted: self tcp vader
>> >
>> > Your MPI job is now going to abort; sorry.
>> >
>> --------------------------------------------------------------------------
>> >
>> >
>> > Instead it hangs for several minutes and finally aborts with:
>> >
>> >
>> --------------------------------------------------------------------------
>> > A request has timed out and will therefore fail:
>> >
>> >  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345
>> >
>> > Your job may terminate as a result of this problem. You may want to
>> > adjust the MCA parameter pmix_server_max_wait and try again. If this
>> > occurred during a connect/accept operation, you can adjust that time
>> > using the pmix_base_exchange_timeout parameter.
>> >
>> --------------------------------------------------------------------------
>> > [node091:19470] *** An error occurred in MPI_Comm_spawn
>> > [node091:19470] *** reported by process [1614086145,0]
>> > [node091:19470] *** on communicator MPI_COMM_WORLD
>> > [node091:19470] *** MPI_ERR_UNKNOWN: unknown error
>> > [node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will
>> > now abort,
>> > [node091:19470] ***    and potentially your MPI job)
>> >
>> > I've tried increasing both pmix_server_max_wait and
>> pmix_base_exchange_timeout
>> > as suggested in the error message, but the result is unchanged (it just
>> takes
>> > longer to time out).
>> >
>> > Once again, if I remove "--map-by node" it runs successfully.
>> >
>> > -Andrew
>> >
>> >
>> >
>> > On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
>> >> I see you are using “preconnect_all” - that is the source of the
>> trouble. I
>> >> don’t believe we have tested that option in years and the code is
>> almost
>> >> certainly dead. I’d suggest removing that option and things should
>> work.
>> >>> On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com>
>> wrote:
>> >>>
>> >>> I'm running into problems trying to spawn MPI processes across
>> multiple
>> >>> nodes on a cluster using recent versions of OpenMPI. Specifically,
>> using
>> >>> the attached Fortan code, compiled using OpenMPI 3.1.2 with:
>> >>>
>> >>> mpif90 test.F90 -o test.exe
>> >>>
>> >>> and run via a PBS scheduler using the attached test1.pbs, it fails as
>> can
>> >>> be seen in the attached testFAIL.err file.
>> >>>
>> >>> If I do the same but using OpenMPI v1.10.3 then it works successfully,
>> >>> giving me the output in the attached testSUCCESS.err file.
>> >>>
>> >>> From testing a few different versions of OpenMPI it seems that the
>> >>> behavior
>> >>> changed between v1.10.7 and v2.0.4.
>> >>>
>> >>> Is there some change in options needed to make this work with newer
>> >>> OpenMPIs?
>> >>>
>> >>> Output from omp_info --all is attached. config.log can be found here:
>> >>>
>> >>> http://users.obs.carnegiescience.edu/abenson/config.log.bz2
>> >>>
>> >>> Thanks for any help you can offer!
>> >>>
>> >>>
>> -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
>> >>> ESS.err.bz2>_______________________________________________ users
>> mailing
>> >>> list
>> >>> users@lists.open-mpi.org
>> >>> https://lists.open-mpi.org/mailman/listinfo/users
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.open-mpi.org
>> >> https://lists.open-mpi.org/mailman/listinfo/users
>> >
>> >
>> > --
>> >
>> > * Andrew Benson:
>> http://users.obs.carnegiescience.edu/abenson/contact.html
>> >
>> > * Galacticus: https://bitbucket.org/abensonca/galacticus
>> >
>>
>>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to