In one of our big runs (512 cpus) the code fails and produces on a list of nodes the following type of error:
I have searched the FAQs but could not find an answer there. There are difficulties getting the code to run because of its shear size but there is no other indication of the problem. Does the following error message mean the some of the nodes have given up? mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error ([361eca8[m2234][0,1,283][m2317, 16][0,) 1Bad address,422(3) ][[ /ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c:114:mca_btl_tcp _frag_send] /ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c[m22 41][0,1,430][m2140[m2152][0,1,150][mca_btl_tcp_frag_send: writev error (3c759a8, 16) Bad address(3) Lydia ------------------------------------------ Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___________________________________________