Dear Ralph,

On May 27, 2014, at 6:31 PM, Ralph Castain <r...@open-mpi.org> wrote:
> So out of curiosity - how was this job launched? Via mpirun or directly using 
> srun?


The job has been submitted using mpirun. However Open MPI is compiled with 
SLURM support (and I start to believe this is might not ideal after all !!!). I 
have a partial job trace dumped by the process when it died:

--------------------------------------------------------------------------
mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited 
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source      
       
diag_OMPI-INTEL.x  0000000000537349  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  0000000000535C1E  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  000000000050CF52  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004F0BB3  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004BEB99  Unknown               Unknown  Unknown
libpthread.so.0    00007FE5B5BE5710  Unknown               Unknown  Unknown
libmlx4-rdmav2.so  00007FE5A8C0A867  Unknown               Unknown  Unknown
mca_btl_openib.so  00007FE5ADA36644  Unknown               Unknown  Unknown
libopen-pal.so.6   00007FE5B288262A  Unknown               Unknown  Unknown
mca_pml_ob1.so     00007FE5AC344FAF  Unknown               Unknown  Unknown
libmpi.so.1        00007FE5B5064E7D  Unknown               Unknown  Unknown
libmpi_mpifh.so.2  00007FE5B531919B  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82EC0CE  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82EBE36  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82EBDFD  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82EC2CD  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82EB798  Unknown               Unknown  Unknown
libelpa.so.0       00007FE5B82E571A  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004101C2  MAIN__                    562  
dirac_exomol_eigen.f90
diag_OMPI-INTEL.x  000000000040A1A6  Unknown               Unknown  Unknown
libc.so.6          00007FE5B4A89D1D  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  000000000040A099  Unknown               Unknown  Unknown

(plus many other trace information like this)

No more information that this unfortunately because not everything library has 
been built using debug flags. The computation is all concentrated in ScaLAPACK 
and ELPA that I recompiled by myself, I run over 8192 MPI and the memory 
allocated per MPI process was below 1 GByte (per MPI). My compute nodes have 64 
GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 nodes are 80% of 
the cluster I have available for this test, I cannot easily reschedule a 
repetition of the test.

I wonder if this message that can be related to libevent may in principle cause 
this seg fault error. I am working to understand the cause on my side but so 
far a reduced problem size using less nodes never failed.

Any help is much appreciated!

Regards,
F

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."


Reply via email to