Hi, I am running some matrix-algebra-based calculations on Amazon EC2 (HVM instances running Ubuntu 11.1 with OpenMPI 1.6.4 and python bindings with mpi4py 1.3). I am using StarCluster to spin up instances so all nodes from a given cluster are in the same placement group (i.e. are on the same 10 Gb network)
My calculations are CPU-bound and I use just a few collective operations (namely allgatherv, statterv, bcast, and reduce) that exchange a non-trivial amount data (the size of full distributed dense matrix reaches tens of gigabytes - e.g. I use allgatherv on that matrix) For smaller matrix sizes everything works fine but once I start increasing the number of parameters in my models and, as a result, increase the number of nodes/workers to keep up I get errors like this one: [node005][[18726,1],125][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node008][[18726,1],8][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node008][[18726,1],108][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node008][[18726,1],28][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node007][[18726,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node001][[18726,1],21][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) I've also seen other network-related errors such as "unable to find path to host". Whenever I get these errors one or more of the EC2 nodes becomes "unreachable" according EC2 Web UI (even though I can log in to those nodes using internal IP aliases) Such nodes typically recover from being "unreachable" after a few minutes but my MPI job hangs anyway. The culprit is usually allgatherv but I've seen reduce and bcast to cause these errors as well. I don't get this errors if I run on a single node (but that's way too slow even with 16 workers so I need to run my jobs on at least 20 nodes) Any idea how to fix this? May be by adjusting some linux and/or OpenMPI parameters? Any help would greatly appreciated! Thanks, Yevgeny