When next you run, I would just add "-mca plm rsh" to your cmd line. You don't 
need to rebuild OMPI to avoid issues with the slurm integration. This will 
still allow OMPI to read the slurm allocation so it knows which nodes to use, 
but won't use slurm to launch the job.

If it is a slurm PMI issue, this should resolve it.


On May 28, 2014, at 12:03 AM, Filippo Spiga <spiga.fili...@gmail.com> wrote:

> Dear Ralph,
> 
> On May 27, 2014, at 6:31 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> So out of curiosity - how was this job launched? Via mpirun or directly 
>> using srun?
> 
> 
> The job has been submitted using mpirun. However Open MPI is compiled with 
> SLURM support (and I start to believe this is might not ideal after all !!!). 
> I have a partial job trace dumped by the process when it died:
> 
> --------------------------------------------------------------------------
> mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited 
> on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> forrtl: error (78): process killed (SIGTERM)
> Image              PC                Routine            Line        Source    
>          
> diag_OMPI-INTEL.x  0000000000537349  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  0000000000535C1E  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  000000000050CF52  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  00000000004F0BB3  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  00000000004BEB99  Unknown               Unknown  Unknown
> libpthread.so.0    00007FE5B5BE5710  Unknown               Unknown  Unknown
> libmlx4-rdmav2.so  00007FE5A8C0A867  Unknown               Unknown  Unknown
> mca_btl_openib.so  00007FE5ADA36644  Unknown               Unknown  Unknown
> libopen-pal.so.6   00007FE5B288262A  Unknown               Unknown  Unknown
> mca_pml_ob1.so     00007FE5AC344FAF  Unknown               Unknown  Unknown
> libmpi.so.1        00007FE5B5064E7D  Unknown               Unknown  Unknown
> libmpi_mpifh.so.2  00007FE5B531919B  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82EC0CE  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82EBE36  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82EBDFD  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82EC2CD  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82EB798  Unknown               Unknown  Unknown
> libelpa.so.0       00007FE5B82E571A  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  00000000004101C2  MAIN__                    562  
> dirac_exomol_eigen.f90
> diag_OMPI-INTEL.x  000000000040A1A6  Unknown               Unknown  Unknown
> libc.so.6          00007FE5B4A89D1D  Unknown               Unknown  Unknown
> diag_OMPI-INTEL.x  000000000040A099  Unknown               Unknown  Unknown
> 
> (plus many other trace information like this)
> 
> No more information that this unfortunately because not everything library 
> has been built using debug flags. The computation is all concentrated in 
> ScaLAPACK and ELPA that I recompiled by myself, I run over 8192 MPI and the 
> memory allocated per MPI process was below 1 GByte (per MPI). My compute 
> nodes have 64 GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 
> nodes are 80% of the cluster I have available for this test, I cannot easily 
> reschedule a repetition of the test.
> 
> I wonder if this message that can be related to libevent may in principle 
> cause this seg fault error. I am working to understand the cause on my side 
> but so far a reduced problem size using less nodes never failed.
> 
> Any help is much appreciated!
> 
> Regards,
> F
> 
> --
> Mr. Filippo SPIGA, M.Sc.
> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
> 
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
> 
> *****
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
> and may be privileged or otherwise protected from disclosure. The contents 
> are not to be disclosed to anyone other than the addressee. Unauthorized 
> recipients are requested to preserve this confidentiality and to advise the 
> sender immediately of any error in transmission."
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to