Hi,

 I am running some matrix-algebra-based calculations on Amazon EC2 (HVM
instances running Ubuntu 11.1 with OpenMPI 1.6.4 and python bindings with
mpi4py 1.3). I am using StarCluster to spin up instances so all nodes from
a given cluster are in the same placement group (i.e. are on the same 10 Gb
network)

My calculations are CPU-bound and I use just a few collective operations
(namely allgatherv, statterv, bcast, and reduce) that exchange a
non-trivial amount data (the size of full distributed dense matrix reaches
tens of gigabytes - e.g. I use allgatherv on that matrix)

For smaller matrix sizes everything works fine but once I start increasing
the number of parameters in my models and, as a result, increase the number
of nodes/workers to keep up I get errors like this one:

[node005][[18726,1],125][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],8][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],108][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],28][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node007][[18726,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node001][[18726,1],21][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

I've also seen other network-related errors such as "unable to find path to
host". Whenever I get these errors one or more of the EC2 nodes becomes
"unreachable" according EC2 Web UI (even though I can log in to those nodes
using internal IP aliases) Such nodes typically recover from being
"unreachable" after a few minutes but my MPI job hangs anyway. The culprit
is usually allgatherv but I've seen reduce and bcast to cause these errors
as well.

I don't get this errors if I run on a single node (but that's way too slow
even with 16 workers so I need to run my jobs on at least 20 nodes)

Any idea how to fix this? May be by adjusting some linux and/or OpenMPI
parameters?

Any help would greatly appreciated!

Thanks,
Yevgeny

Reply via email to