I see - then the problem is that at least one node is unable to communicate via
TCP back to where mpirun is executing. Might be a firewall, or could be that we
are selecting the wrong network if multiple NICs are around. I assume that you
use additional nodes when running against the larger data
On Thu, 17 Jan 2013 11:54:13 -0800
Ralph Castain wrote:
> Or is this happening on startup of the larger job, or during a call to
> MPI_Comm_spawn?
This happens on a startup. Mpirun spawns processes and when they start talking
to eachother during setup phase, I get this kind of error. Running t
On Jan 17, 2013, at 2:25 AM, Jure Pečar wrote:
> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain wrote:
>
>> This one means that a backend node lost its connection to mpirun. We use a
>> TCP socket between the daemon on a node and mpirun to launch the processes
>> and to detect if/when th
On Wed, 16 Jan 2013 07:46:41 -0800
Ralph Castain wrote:
> This one means that a backend node lost its connection to mpirun. We use a
> TCP socket between the daemon on a node and mpirun to launch the processes
> and to detect if/when that node fails for some reason.
Hm. And what would be the r
On Jan 16, 2013, at 7:41 AM, Jure Pečar wrote:
>
> Hello,
>
> I have a large fortran code processing data (weather forecast). It runs ok
> with smaller dataset, but on larger dataset I get some errors I've never seen
> before:
>
> node061:05144] [[55141,0],11]->[[55141,0],0] mca_oob_tcp_msg