Sorry for delay - this should be fixed by 
https://github.com/open-mpi/ompi/pull/5854

> On Sep 19, 2018, at 8:00 AM, Andrew Benson <abenso...@gmail.com> wrote:
> 
> On further investigation removing the "preconnect_all" option does change the 
> problem at least. Without "preconnect_all" I no longer see:
> 
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>  Process 1 ([[32179,2],15]) is on host: node092
>  Process 2 ([[32179,2],0]) is on host: unknown!
>  BTLs attempted: self tcp vader
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> 
> 
> Instead it hangs for several minutes and finally aborts with:
> 
> --------------------------------------------------------------------------
> A request has timed out and will therefore fail:
> 
>  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345
> 
> Your job may terminate as a result of this problem. You may want to
> adjust the MCA parameter pmix_server_max_wait and try again. If this
> occurred during a connect/accept operation, you can adjust that time
> using the pmix_base_exchange_timeout parameter.
> --------------------------------------------------------------------------
> [node091:19470] *** An error occurred in MPI_Comm_spawn
> [node091:19470] *** reported by process [1614086145,0]
> [node091:19470] *** on communicator MPI_COMM_WORLD
> [node091:19470] *** MPI_ERR_UNKNOWN: unknown error
> [node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
> now abort,
> [node091:19470] ***    and potentially your MPI job)
> 
> I've tried increasing both pmix_server_max_wait and 
> pmix_base_exchange_timeout 
> as suggested in the error message, but the result is unchanged (it just takes 
> longer to time out).
> 
> Once again, if I remove "--map-by node" it runs successfully.
> 
> -Andrew
> 
> 
> 
> On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
>> I see you are using “preconnect_all” - that is the source of the trouble. I
>> don’t believe we have tested that option in years and the code is almost
>> certainly dead. I’d suggest removing that option and things should work.
>>> On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com> wrote:
>>> 
>>> I'm running into problems trying to spawn MPI processes across multiple
>>> nodes on a cluster using recent versions of OpenMPI. Specifically, using
>>> the attached Fortan code, compiled using OpenMPI 3.1.2 with:
>>> 
>>> mpif90 test.F90 -o test.exe
>>> 
>>> and run via a PBS scheduler using the attached test1.pbs, it fails as can
>>> be seen in the attached testFAIL.err file.
>>> 
>>> If I do the same but using OpenMPI v1.10.3 then it works successfully,
>>> giving me the output in the attached testSUCCESS.err file.
>>> 
>>> From testing a few different versions of OpenMPI it seems that the
>>> behavior
>>> changed between v1.10.7 and v2.0.4.
>>> 
>>> Is there some change in options needed to make this work with newer
>>> OpenMPIs?
>>> 
>>> Output from omp_info --all is attached. config.log can be found here:
>>> 
>>> http://users.obs.carnegiescience.edu/abenson/config.log.bz2
>>> 
>>> Thanks for any help you can offer!
>>> 
>>> -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
>>> ESS.err.bz2>_______________________________________________ users mailing
>>> list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> -- 
> 
> * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html
> 
> * Galacticus: https://bitbucket.org/abensonca/galacticus
> 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to