Hi Lydia:

errno 24 means "Too many open files".  When we have seen this, I believe
we increased the number of file descriptors available to the mpirun process
to get past this.

In my case, my shell (tcsh) defaults to 256. I increase it with a call to "limit descriptors"
as shown below.  I think other shells may have other commands.

burl-ct-v40z-0 41 =>limit
cputime         unlimited
filesize        unlimited
datasize        unlimited
stacksize       10240 kbytes
coredumpsize    0 kbytes
vmemoryuse      unlimited
descriptors     256
burl-ct-v40z-0 42 =>limit descriptors 64000
burl-ct-v40z-0 43 =>limit
cputime         unlimited
filesize        unlimited
datasize        unlimited
stacksize       10240 kbytes
coredumpsize    0 kbytes
vmemoryuse      unlimited
descriptors     64000
burl-ct-v40z-0 44 =>


Lydia Heck wrote On 11/22/06 15:45,:

I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1

compiler used: studio11

The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking

When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
...........
..............
..............

The myrinet ports have been opened and the job is running
as one of the nodes shows ....

ps -eaf | grep dph0elh
dph0elh  1068     1   0 20:40:00 ??          0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
   root  1110  1106   0 20:43:46 pts/4       0:00 grep dph0elh
dph0elh  1070  1068   0 20:40:02 ??          0:00 ../b_eff
dph0elh  1074  1068   0 20:40:02 ??          0:00 ../b_eff
dph0elh  1072  1068   0 20:40:02 ??          0:00 ../b_eff

any idea ?

Lydia


------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to