That's a weird one -- it looks like having too many open files on your system 
is causing a cascading set of failures.  

Are you saying that your program runs for a while and then on iteration 32, it 
fails with errors like this?  If so, I'd like for a file descriptor leak in 
your program.


On Jan 4, 2013, at 12:48 PM, Mariana Vargas Magana <mmaria...@yahoo.com.mx> 
wrote:

> Hello open MPI users:
> 
> I was just running a program that usually works well in the cluster and 
> suddenly in the 32 iteration I get this strange set of errors associated 
> with. I will appreciate if someone could give me some hint of the problem and 
> how to solve
> 
> Thanks!
> 
> Mariana
> 
> 
> /usr/bin/ssh: error while loading shared libraries: libcrypt.so.1: cannot 
> open shared object file: Error 23
> /usr/bin/ssh: error while loading shared libraries: libutil.so.1: cannot open 
> shared object file: Error 23
> /usr/bin/ssh: error while loading shared libraries: libfipscheck.so.1: cannot 
> open shared object file: Error 23
> /usr/bin/ssh: error while loading shared libraries: libkrb5.so.3: cannot open 
> shared object file: Error 23
> --------------------------------------------------------------------------
> A daemon (pid 1486) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>    no-hostfile
> But I couldn't open the help file:
>    /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files 
> in system.  Sorry!
> --------------------------------------------------------------------------
> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
> base/ras_base_allocate.c at line 200
> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 99
> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
> plm_rsh_module.c at line 1167
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>    no-hostfile
> But I couldn't open the help file:
>    /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files 
> in system.  Sorry!
> --------------------------------------------------------------------------
> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
> base/ras_base_allocate.c at line 200
> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 99
> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
> plm_rsh_module.c at line 1167
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to