*MPI_ERR_PROC_FAILED is not yet a valid error in MPI. It is coming from ULFM, an extension to MPI that is not yet in the OMPI master.*
*Daniel what version of Open MPI are you using ? Are you sure you are not mixing multiple versions due to PATH/LD_LIBRARY_PATH ?* *George.* On Mon, Jan 11, 2021 at 21:31 Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Daniel, > > the test works in my environment (1 node, 32 GB memory) with all the > mentioned parameters. > > Did you check the memory usage on your nodes and made sure the oom > killer did not shoot any process? > > Cheers, > > Gilles > > On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users > <users@lists.open-mpi.org> wrote: > > > > Hi. > > > > Thanks for responding. I have taken the most important parts from my > code and I created a test that reproduces the behavior I described > previously. > > > > I attach to this e-mail the compressed file "test.tar.gz". Inside him, > you can find: > > > > 1.- The .c source code "test.c", which I compiled with "mpicc -g -O3 > test.c -o test -lm". The main work is performed on the function > "work_on_grid", starting at line 162. > > 2.- Four execution examples in two different machines (my own and a > cluster machine), which I executed with "mpiexec -np 16 --machinefile > hostfile --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 100 > ./test 4096 4096", varying the last two arguments with 4096, 8192 and 16384 > (a matrix size). The error appears with bigger numbers (8192 in my machine, > 16384 in the cluster) > > 3.- The "ompi_info -a" output from the two machines. > > 4.- The hostfile. > > > > The duration of the delay is just a few seconds, about 3 ~ 4. > > > > Essentially, the first error message I get from a waiting process is > "74: MPI_ERR_PROC_FAILED: Process Failure". > > > > Hope this information can help. > > > > Thanks a lot for your time. > > > > El 08/01/21 a las 18:40, George Bosilca via users escribió: > > > > Daniel, > > > > There are no timeouts in OMPI with the exception of the initial > connection over TCP, where we use the socket timeout to prevent deadlocks. > As you already did quite a few communicator duplications and other > collective communications before you see the timeout, we need more info > about this. As Gilles indicated, having the complete output might help. > What is the duration of the delay for the waiting process ? Also, can you > post a replicator of this issue ? > > > > George. > > > > > > On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> > >> Daniel, > >> > >> Can you please post the full error message and share a reproducer for > >> this issue? > >> > >> Cheers, > >> > >> Gilles > >> > >> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users > >> <users@lists.open-mpi.org> wrote: > >> > > >> > Hi all. > >> > > >> > Actually I'm implementing an algorithm that creates a process grid > and divides it into row and column communicators as follows: > >> > > >> > col_comm0 col_comm1 col_comm2 col_comm3 > >> > row_comm0 P0 P1 P2 P3 > >> > row_comm1 P4 P5 P6 P7 > >> > row_comm2 P8 P9 P10 P11 > >> > row_comm3 P12 P13 P14 P15 > >> > > >> > Then, every process works on its own column communicator and > broadcast data on row communicators. > >> > While column operations are being executed, processes not included in > the current column communicator just wait for results. > >> > > >> > In a moment, a column communicator could be splitted to create a temp > communicator and allow only the right processes to work on it. > >> > > >> > At the end of a step, a call to MPI_Barrier (on a duplicate of > MPI_COMM_WORLD) is executed to sync all processes and avoid bad results. > >> > > >> > With a small amount of data (a small matrix) the MPI_Barrier call > syncs correctly on the communicator that includes all processes and > processing ends fine. > >> > But when the amount of data (a big matrix) is incremented, operations > on column communicators take more time to finish and hence waiting time > also increments for waiting processes. > >> > > >> > After a few time, waiting processes return an error when they have > not received the broadcast (MPI_Bcast) on row communicators or when they > have finished their work at the sync point (MPI_Barrier). But when the > operations on the current column communicator end, the still active > processes try to broadcast on row communicators and they fail because the > waiting processes have returned an error. So all processes fail in > different moment in time. > >> > > >> > So my problem is that waiting processes "believe" that the current > operations have failed (but they have not finished yet!) and they fail too. > >> > > >> > So I have a question about MPI_Bcast/MPI_Barrier: > >> > > >> > Is there a way to increment the timeout a process can wait for a > broadcast or barrier to be completed? > >> > > >> > Here is my machine and OpenMPI info: > >> > - OpenMPI version: Open MPI 4.1.0u1a1 > >> > - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 > 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux > >> > > >> > Thanks in advance for reading my description/question. > >> > > >> > Best regards. > >> > > >> > -- > >> > Daniel Torres > >> > LIPN - Université Sorbonne Paris Nord > > > > -- > > Daniel Torres > > LIPN - Université Sorbonne Paris Nord >