Re: [OMPI users] help me understand these error msgs

2013-01-22 Thread Ralph Castain
I see - then the problem is that at least one node is unable to communicate via TCP back to where mpirun is executing. Might be a firewall, or could be that we are selecting the wrong network if multiple NICs are around. I assume that you use additional nodes when running against the larger data

Re: [OMPI users] help me understand these error msgs

2013-01-22 Thread Jure Pečar
On Thu, 17 Jan 2013 11:54:13 -0800 Ralph Castain wrote: > Or is this happening on startup of the larger job, or during a call to > MPI_Comm_spawn? This happens on a startup. Mpirun spawns processes and when they start talking to eachother during setup phase, I get this kind of error. Running t

Re: [OMPI users] help me understand these error msgs

2013-01-17 Thread Ralph Castain
On Jan 17, 2013, at 2:25 AM, Jure Pečar wrote: > On Wed, 16 Jan 2013 07:46:41 -0800 > Ralph Castain wrote: > >> This one means that a backend node lost its connection to mpirun. We use a >> TCP socket between the daemon on a node and mpirun to launch the processes >> and to detect if/when th

Re: [OMPI users] help me understand these error msgs

2013-01-17 Thread Jure Pečar
On Wed, 16 Jan 2013 07:46:41 -0800 Ralph Castain wrote: > This one means that a backend node lost its connection to mpirun. We use a > TCP socket between the daemon on a node and mpirun to launch the processes > and to detect if/when that node fails for some reason. Hm. And what would be the r

Re: [OMPI users] help me understand these error msgs

2013-01-16 Thread Ralph Castain
On Jan 16, 2013, at 7:41 AM, Jure Pečar wrote: > > Hello, > > I have a large fortran code processing data (weather forecast). It runs ok > with smaller dataset, but on larger dataset I get some errors I've never seen > before: > > node061:05144] [[55141,0],11]->[[55141,0],0] mca_oob_tcp_msg