Daniel, the test works in my environment (1 node, 32 GB memory) with all the mentioned parameters.
Did you check the memory usage on your nodes and made sure the oom killer did not shoot any process? Cheers, Gilles On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users <users@lists.open-mpi.org> wrote: > > Hi. > > Thanks for responding. I have taken the most important parts from my code and > I created a test that reproduces the behavior I described previously. > > I attach to this e-mail the compressed file "test.tar.gz". Inside him, you > can find: > > 1.- The .c source code "test.c", which I compiled with "mpicc -g -O3 test.c > -o test -lm". The main work is performed on the function "work_on_grid", > starting at line 162. > 2.- Four execution examples in two different machines (my own and a cluster > machine), which I executed with "mpiexec -np 16 --machinefile hostfile > --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 100 ./test 4096 > 4096", varying the last two arguments with 4096, 8192 and 16384 (a matrix > size). The error appears with bigger numbers (8192 in my machine, 16384 in > the cluster) > 3.- The "ompi_info -a" output from the two machines. > 4.- The hostfile. > > The duration of the delay is just a few seconds, about 3 ~ 4. > > Essentially, the first error message I get from a waiting process is "74: > MPI_ERR_PROC_FAILED: Process Failure". > > Hope this information can help. > > Thanks a lot for your time. > > El 08/01/21 a las 18:40, George Bosilca via users escribió: > > Daniel, > > There are no timeouts in OMPI with the exception of the initial connection > over TCP, where we use the socket timeout to prevent deadlocks. As you > already did quite a few communicator duplications and other collective > communications before you see the timeout, we need more info about this. As > Gilles indicated, having the complete output might help. What is the duration > of the delay for the waiting process ? Also, can you post a replicator of > this issue ? > > George. > > > On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users > <users@lists.open-mpi.org> wrote: >> >> Daniel, >> >> Can you please post the full error message and share a reproducer for >> this issue? >> >> Cheers, >> >> Gilles >> >> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users >> <users@lists.open-mpi.org> wrote: >> > >> > Hi all. >> > >> > Actually I'm implementing an algorithm that creates a process grid and >> > divides it into row and column communicators as follows: >> > >> > col_comm0 col_comm1 col_comm2 col_comm3 >> > row_comm0 P0 P1 P2 P3 >> > row_comm1 P4 P5 P6 P7 >> > row_comm2 P8 P9 P10 P11 >> > row_comm3 P12 P13 P14 P15 >> > >> > Then, every process works on its own column communicator and broadcast >> > data on row communicators. >> > While column operations are being executed, processes not included in the >> > current column communicator just wait for results. >> > >> > In a moment, a column communicator could be splitted to create a temp >> > communicator and allow only the right processes to work on it. >> > >> > At the end of a step, a call to MPI_Barrier (on a duplicate of >> > MPI_COMM_WORLD) is executed to sync all processes and avoid bad results. >> > >> > With a small amount of data (a small matrix) the MPI_Barrier call syncs >> > correctly on the communicator that includes all processes and processing >> > ends fine. >> > But when the amount of data (a big matrix) is incremented, operations on >> > column communicators take more time to finish and hence waiting time also >> > increments for waiting processes. >> > >> > After a few time, waiting processes return an error when they have not >> > received the broadcast (MPI_Bcast) on row communicators or when they have >> > finished their work at the sync point (MPI_Barrier). But when the >> > operations on the current column communicator end, the still active >> > processes try to broadcast on row communicators and they fail because the >> > waiting processes have returned an error. So all processes fail in >> > different moment in time. >> > >> > So my problem is that waiting processes "believe" that the current >> > operations have failed (but they have not finished yet!) and they fail too. >> > >> > So I have a question about MPI_Bcast/MPI_Barrier: >> > >> > Is there a way to increment the timeout a process can wait for a broadcast >> > or barrier to be completed? >> > >> > Here is my machine and OpenMPI info: >> > - OpenMPI version: Open MPI 4.1.0u1a1 >> > - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC >> > 2020 x86_64 x86_64 x86_64 GNU/Linux >> > >> > Thanks in advance for reading my description/question. >> > >> > Best regards. >> > >> > -- >> > Daniel Torres >> > LIPN - Université Sorbonne Paris Nord > > -- > Daniel Torres > LIPN - Université Sorbonne Paris Nord