One of our users/friends has also sent us some example code to do this
internally - I hope to find the time to include that capability in the code
base shortly. I'll advise when we do.


On 11/22/06 2:16 PM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote:

> 
> Hi Lydia:
> 
> errno 24 means "Too many open files".  When we have seen this, I believe
> we increased the number of file descriptors available to the mpirun process
> to get past this.
> 
> In my case, my shell (tcsh) defaults to 256.  I increase it with a call
> to "limit descriptors"
> as shown below.  I think other shells may have other commands.
> 
>  burl-ct-v40z-0 41 =>limit
> cputime         unlimited
> filesize        unlimited
> datasize        unlimited
> stacksize       10240 kbytes
> coredumpsize    0 kbytes
> vmemoryuse      unlimited
> descriptors     256
>  burl-ct-v40z-0 42 =>limit descriptors 64000
>  burl-ct-v40z-0 43 =>limit
> cputime         unlimited
> filesize        unlimited
> datasize        unlimited
> stacksize       10240 kbytes
> coredumpsize    0 kbytes
> vmemoryuse      unlimited
> descriptors     64000
>  burl-ct-v40z-0 44 =>
> 
> 
> Lydia Heck wrote On 11/22/06 15:45,:
> 
>> I have - again - successfully built and installed
>> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
>> version of openmpi is 1.2b1
>> 
>> compiler used: studio11
>> 
>> The code is a benchmark b_eff which runs usually fine - I have used
>> extensively
>> it for benchmarking
>> 
>> When I try 192 CPUs I get
>> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> ...........
>> ..............
>> ..............
>> 
>> The myrinet ports have been opened and the job is running
>> as one of the nodes shows ....
>> 
>> ps -eaf | grep dph0elh
>> dph0elh  1068     1   0 20:40:00 ??          0:00 /opt/ompi/bin/orted
>> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
>>    root  1110  1106   0 20:43:46 pts/4       0:00 grep dph0elh
>> dph0elh  1070  1068   0 20:40:02 ??          0:00 ../b_eff
>> dph0elh  1074  1068   0 20:40:02 ??          0:00 ../b_eff
>> dph0elh  1072  1068   0 20:40:02 ??          0:00 ../b_eff
>> 
>> any idea ?
>> 
>> Lydia
>> 
>> 
>> ------------------------------------------
>> Dr E L  Heck
>> 
>> University of Durham
>> Institute for Computational Cosmology
>> Ogden Centre
>> Department of Physics
>> South Road
>> 
>> DURHAM, DH1 3LE
>> United Kingdom
>> 
>> e-mail: lydia.h...@durham.ac.uk
>> 
>> Tel.: + 44 191 - 334 3628
>> Fax.: + 44 191 - 334 3645
>> ___________________________________________
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 


Reply via email to