That's a weird one -- it looks like having too many open files on your system is causing a cascading set of failures.
Are you saying that your program runs for a while and then on iteration 32, it fails with errors like this? If so, I'd like for a file descriptor leak in your program. On Jan 4, 2013, at 12:48 PM, Mariana Vargas Magana <mmaria...@yahoo.com.mx> wrote: > Hello open MPI users: > > I was just running a program that usually works well in the cluster and > suddenly in the 32 iteration I get this strange set of errors associated > with. I will appreciate if someone could give me some hint of the problem and > how to solve > > Thanks! > > Mariana > > > /usr/bin/ssh: error while loading shared libraries: libcrypt.so.1: cannot > open shared object file: Error 23 > /usr/bin/ssh: error while loading shared libraries: libutil.so.1: cannot open > shared object file: Error 23 > /usr/bin/ssh: error while loading shared libraries: libfipscheck.so.1: cannot > open shared object file: Error 23 > /usr/bin/ssh: error while loading shared libraries: libkrb5.so.3: cannot open > shared object file: Error 23 > -------------------------------------------------------------------------- > A daemon (pid 1486) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Sorry! You were supposed to get help about: > no-hostfile > But I couldn't open the help file: > /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files > in system. Sorry! > -------------------------------------------------------------------------- > [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file > base/ras_base_allocate.c at line 200 > [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file > base/plm_base_launch_support.c at line 99 > [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file > plm_rsh_module.c at line 1167 > -------------------------------------------------------------------------- > Sorry! You were supposed to get help about: > no-hostfile > But I couldn't open the help file: > /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files > in system. Sorry! > -------------------------------------------------------------------------- > [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file > base/ras_base_allocate.c at line 200 > [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file > base/plm_base_launch_support.c at line 99 > [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file > plm_rsh_module.c at line 1167 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/