In one of our big runs (512 cpus) the code fails and produces on a list
of nodes the following type of error:

I have searched the FAQs but could not find an answer there.
There are difficulties getting the code to run because of its shear size
but there is no other indication of the problem.

Does the following error message mean the some of the nodes have given up?


mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error
([361eca8[m2234][0,1,283][m2317, 16][0,)
        1Bad address,422(3)
][[
/ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c:114:mca_btl_tcp
_frag_send]
/ws/hpc-ct-7.1/builds/7.1.build-ct7.1-003c/ompi-ct7.1/ompi/mca/btl/tcp/btl_tcp_frag.c[m22
41][0,1,430][m2140[m2152][0,1,150][mca_btl_tcp_frag_send: writev error (3c759a8,
16)
        Bad address(3)


Lydia

------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________

Reply via email to