One of our users/friends has also sent us some example code to do this internally - I hope to find the time to include that capability in the code base shortly. I'll advise when we do.
On 11/22/06 2:16 PM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote: > > Hi Lydia: > > errno 24 means "Too many open files". When we have seen this, I believe > we increased the number of file descriptors available to the mpirun process > to get past this. > > In my case, my shell (tcsh) defaults to 256. I increase it with a call > to "limit descriptors" > as shown below. I think other shells may have other commands. > > burl-ct-v40z-0 41 =>limit > cputime unlimited > filesize unlimited > datasize unlimited > stacksize 10240 kbytes > coredumpsize 0 kbytes > vmemoryuse unlimited > descriptors 256 > burl-ct-v40z-0 42 =>limit descriptors 64000 > burl-ct-v40z-0 43 =>limit > cputime unlimited > filesize unlimited > datasize unlimited > stacksize 10240 kbytes > coredumpsize 0 kbytes > vmemoryuse unlimited > descriptors 64000 > burl-ct-v40z-0 44 => > > > Lydia Heck wrote On 11/22/06 15:45,: > >> I have - again - successfully built and installed >> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster >> version of openmpi is 1.2b1 >> >> compiler used: studio11 >> >> The code is a benchmark b_eff which runs usually fine - I have used >> extensively >> it for benchmarking >> >> When I try 192 CPUs I get >> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> ........... >> .............. >> .............. >> >> The myrinet ports have been opened and the job is running >> as one of the nodes shows .... >> >> ps -eaf | grep dph0elh >> dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted >> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - >> root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh >> dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff >> dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff >> dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff >> >> any idea ? >> >> Lydia >> >> >> ------------------------------------------ >> Dr E L Heck >> >> University of Durham >> Institute for Computational Cosmology >> Ogden Centre >> Department of Physics >> South Road >> >> DURHAM, DH1 3LE >> United Kingdom >> >> e-mail: lydia.h...@durham.ac.uk >> >> Tel.: + 44 191 - 334 3628 >> Fax.: + 44 191 - 334 3645 >> ___________________________________________ >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>