What is even stranger is that the error occurs when attempting to launch a 
daemon! Does your program do a series of comm_spawns?

Sent from my iPad

On Jan 10, 2013, at 7:28 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
wrote:

> That's a weird one -- it looks like having too many open files on your system 
> is causing a cascading set of failures.  
> 
> Are you saying that your program runs for a while and then on iteration 32, 
> it fails with errors like this?  If so, I'd like for a file descriptor leak 
> in your program.
> 
> 
> On Jan 4, 2013, at 12:48 PM, Mariana Vargas Magana <mmaria...@yahoo.com.mx> 
> wrote:
> 
>> Hello open MPI users:
>> 
>> I was just running a program that usually works well in the cluster and 
>> suddenly in the 32 iteration I get this strange set of errors associated 
>> with. I will appreciate if someone could give me some hint of the problem 
>> and how to solve
>> 
>> Thanks!
>> 
>> Mariana
>> 
>> 
>> /usr/bin/ssh: error while loading shared libraries: libcrypt.so.1: cannot 
>> open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libutil.so.1: cannot 
>> open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libfipscheck.so.1: 
>> cannot open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libkrb5.so.3: cannot 
>> open shared object file: Error 23
>> --------------------------------------------------------------------------
>> A daemon (pid 1486) died unexpectedly with status 127 while attempting
>> to launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> Sorry!  You were supposed to get help about:
>>   no-hostfile
>> But I couldn't open the help file:
>>   /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files 
>> in system.  Sorry!
>> --------------------------------------------------------------------------
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
>> base/ras_base_allocate.c at line 200
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
>> base/plm_base_launch_support.c at line 99
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file 
>> plm_rsh_module.c at line 1167
>> --------------------------------------------------------------------------
>> Sorry!  You were supposed to get help about:
>>   no-hostfile
>> But I couldn't open the help file:
>>   /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files 
>> in system.  Sorry!
>> --------------------------------------------------------------------------
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
>> base/ras_base_allocate.c at line 200
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
>> base/plm_base_launch_support.c at line 99
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file 
>> plm_rsh_module.c at line 1167
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to