Sounds like something is making the TCP connections unstable. Last time I looked at HVM, they were running something like 64G of memory? If you have more than one proc on a node (as your output would indicate), and you are doing collectives on such large data sizes, it's quite possible you are running out of memory due to the way the collective algorithms work - and perhaps trashing the connection (which would explain it being unreachable until the OS can reset it).
You might try running with fewer procs/node to see if that helps. On Apr 4, 2013, at 11:10 AM, Yevgeny Popkov <ypop...@gmail.com> wrote: > Hi, > > I am running some matrix-algebra-based calculations on Amazon EC2 (HVM > instances running Ubuntu 11.1 with OpenMPI 1.6.4 and python bindings with > mpi4py 1.3). I am using StarCluster to spin up instances so all nodes from a > given cluster are in the same placement group (i.e. are on the same 10 Gb > network) > > My calculations are CPU-bound and I use just a few collective operations > (namely allgatherv, statterv, bcast, and reduce) that exchange a non-trivial > amount data (the size of full distributed dense matrix reaches tens of > gigabytes - e.g. I use allgatherv on that matrix) > > For smaller matrix sizes everything works fine but once I start increasing > the number of parameters in my models and, as a result, increase the number > of nodes/workers to keep up I get errors like this one: > > [node005][[18726,1],125][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node008][[18726,1],8][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node008][[18726,1],108][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node008][[18726,1],28][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node007][[18726,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [node001][[18726,1],21][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > > I've also seen other network-related errors such as "unable to find path to > host". Whenever I get these errors one or more of the EC2 nodes becomes > "unreachable" according EC2 Web UI (even though I can log in to those nodes > using internal IP aliases) Such nodes typically recover from being > "unreachable" after a few minutes but my MPI job hangs anyway. The culprit is > usually allgatherv but I've seen reduce and bcast to cause these errors as > well. > > I don't get this errors if I run on a single node (but that's way too slow > even with 16 workers so I need to run my jobs on at least 20 nodes) > > Any idea how to fix this? May be by adjusting some linux and/or OpenMPI > parameters? > > Any help would greatly appreciated! > > Thanks, > Yevgeny > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users