What is even stranger is that the error occurs when attempting to launch a daemon! Does your program do a series of comm_spawns?
Sent from my iPad On Jan 10, 2013, at 7:28 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: > That's a weird one -- it looks like having too many open files on your system > is causing a cascading set of failures. > > Are you saying that your program runs for a while and then on iteration 32, > it fails with errors like this? If so, I'd like for a file descriptor leak > in your program. > > > On Jan 4, 2013, at 12:48 PM, Mariana Vargas Magana <mmaria...@yahoo.com.mx> > wrote: > >> Hello open MPI users: >> >> I was just running a program that usually works well in the cluster and >> suddenly in the 32 iteration I get this strange set of errors associated >> with. I will appreciate if someone could give me some hint of the problem >> and how to solve >> >> Thanks! >> >> Mariana >> >> >> /usr/bin/ssh: error while loading shared libraries: libcrypt.so.1: cannot >> open shared object file: Error 23 >> /usr/bin/ssh: error while loading shared libraries: libutil.so.1: cannot >> open shared object file: Error 23 >> /usr/bin/ssh: error while loading shared libraries: libfipscheck.so.1: >> cannot open shared object file: Error 23 >> /usr/bin/ssh: error while loading shared libraries: libkrb5.so.3: cannot >> open shared object file: Error 23 >> -------------------------------------------------------------------------- >> A daemon (pid 1486) died unexpectedly with status 127 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> Sorry! You were supposed to get help about: >> no-hostfile >> But I couldn't open the help file: >> /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files >> in system. Sorry! >> -------------------------------------------------------------------------- >> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file >> base/ras_base_allocate.c at line 200 >> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file >> base/plm_base_launch_support.c at line 99 >> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file >> plm_rsh_module.c at line 1167 >> -------------------------------------------------------------------------- >> Sorry! You were supposed to get help about: >> no-hostfile >> But I couldn't open the help file: >> /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files >> in system. Sorry! >> -------------------------------------------------------------------------- >> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file >> base/ras_base_allocate.c at line 200 >> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file >> base/plm_base_launch_support.c at line 99 >> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file >> plm_rsh_module.c at line 1167 >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users