I have - again - successfully built and installed mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster version of openmpi is 1.2b1
compiler used: studio11 The code is a benchmark b_eff which runs usually fine - I have used extensively it for benchmarking When I try 192 CPUs I get m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. ........... .............. .............. The myrinet ports have been opened and the job is running as one of the nodes shows .... ps -eaf | grep dph0elh dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff any idea ? Lydia ------------------------------------------ Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___________________________________________