Hi Lydia:
errno 24 means "Too many open files". When we have seen this, I believe
we increased the number of file descriptors available to the mpirun process
to get past this.
In my case, my shell (tcsh) defaults to 256. I increase it with a call
to "limit descriptors"
as shown below. I think other shells may have other commands.
burl-ct-v40z-0 41 =>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
vmemoryuse unlimited
descriptors 256
burl-ct-v40z-0 42 =>limit descriptors 64000
burl-ct-v40z-0 43 =>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
vmemoryuse unlimited
descriptors 64000
burl-ct-v40z-0 44 =>
Lydia Heck wrote On 11/22/06 15:45,:
I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1
compiler used: studio11
The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking
When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
...........
..............
..............
The myrinet ports have been opened and the job is running
as one of the nodes shows ....
ps -eaf | grep dph0elh
dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff
any idea ?
Lydia
------------------------------------------
Dr E L Heck
University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road
DURHAM, DH1 3LE
United Kingdom
e-mail: lydia.h...@durham.ac.uk
Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================