Dear all,

I am using Open MPI v1.8.2 night snapshot compiled with SLURM support (version 
14.03pre5). These two messages below appeared during a job of 2048 MPI that 
died after 24 hours! 

[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 
(add); write change was 0 (none): Operation not permitted

[warn] Epoll ADD(4) on fd 2 failed.  Old events were 0; read change was 0 
(none); write change was 1 (add): Operation not permitted


The first one, appeared immediately at the beginning had no effect. The 
application started to compute and it successfully called a big parallel 
eigensolver. The second message appeared after 18~19 hours of non-stop 
computation and the application crashed without showing any other error 
message! Regularly I was checking that MPI processes were not stuck, after this 
message the processes were all aborted without dumping anything on 
stdout/stderr. It is quite weird.

I believe these messages come from Open MPI (but correct me if I am wrong!). I 
am going to look at the application and the various libraries to find out if 
something is wrong. In the meanwhile it will be a great help if anyone can 
clarify the exact meaning of these warning messages.

Many thanks in advance.

Regards,
Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."


Reply via email to