Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?

2015-09-20 Thread Lev Givon
Received from Ralph Castain on Sun, Sep 20, 2015 at 06:54:41PM EDT: (snip) > > On a closer look, it seems that the "17" corresponds to the number of times > > the > > error was emitted after its occurrence regardless of how many actual MPI > > processes > > were running (each of the MPI process

Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?

2015-09-20 Thread Ralph Castain
> On Sep 20, 2015, at 2:30 PM, Lev Givon wrote: > > Received from Ralph Castain on Sun, Sep 20, 2015 at 05:08:10PM EDT: >>> On Sep 20, 2015, at 12:57 PM, Lev Givon wrote: >>> >>> While debugging a problem that is causing emission of a non-fatal OpenMPI >>> error >>> message to stderr, the err

Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version but not with 1.8.7 one.

2015-09-20 Thread Jorge D'Elia
Hi Ralph, Many thanks for your fast answer! - Mensaje original - > De: "Ralph Castain" > Para: "Open MPI Users" > Enviado: Domingo, 20 de Septiembre 2015 18:16:56 > Asunto: Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version > but not with 1.8.7 one. > > Is the connecti

Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?

2015-09-20 Thread Lev Givon
Received from Ralph Castain on Sun, Sep 20, 2015 at 05:08:10PM EDT: > > On Sep 20, 2015, at 12:57 PM, Lev Givon wrote: > > > > While debugging a problem that is causing emission of a non-fatal OpenMPI > > error > > message to stderr, the error message is followed by a line similar to the > > fol

Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version but not with 1.8.7 one.

2015-09-20 Thread Ralph Castain
Is the connection from node1 to the head node a direct one, or is there a difference in the ethernet subnets between them? Can you show us the output of ifconfig from each node? > On Sep 20, 2015, at 12:19 PM, Jorge D'Elia wrote: > > Hi all, > > We have used the Open MPI distributions up to

Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?

2015-09-20 Thread Ralph Castain
Just to be clear: you are starting the single process using “srun -n 1 ./app”, and the app calls MPI_Comm_spawn? I’m not sure that’s really supported…I think there might be something in Slurm behind that call, but I have no idea if it really works. > On Sep 20, 2015, at 12:57 PM, Lev Givon wr

[OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?

2015-09-20 Thread Lev Givon
While debugging a problem that is causing emission of a non-fatal OpenMPI error message to stderr, the error message is followed by a line similar to the following (I have help message aggregation turned on): [myhost:10008] 17 more processes have sent help message some_file.txt / blah blah failed

[OMPI users] send() to socket 9 failed with the 1.10.0 version but not with 1.8.7 one.

2015-09-20 Thread Jorge D'Elia
Hi all, We have used the Open MPI distributions up to the 1.8.7 version without any problem in a small LINUX cluster built with diskless nodes (x86_64, Fedora 17, Linux version 4.1.1 (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC))). However, from the 1.8.8 version, we have a problem with

Re: [OMPI users] C/R Enabled Debugging

2015-09-20 Thread Ralph Castain
Hi Zhang We have seen little interest in binary level CR over the years, which is the primary reason the support has lapsed. The approach just doesn’t scale very well. Once the graduate student who wrote it received his degree, there simply wasn’t enough user-level interest to motivate the deve